Pseudospectral Shattering, the Sign Function, and Diagonalization in Nearly Matrix Multiplication Time

Banks, Jess; Garza-Vargas, Jorge; Kulkarni, Archit; Srivastava, Nikhil

doi:10.1007/s10208-022-09577-5

Pseudospectral Shattering, the Sign Function, and Diagonalization in Nearly Matrix Multiplication Time

Open access
Published: 24 August 2022

Volume 23, pages 1959–2047, (2023)
Cite this article

Download PDF

You have full access to this open access article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Pseudospectral Shattering, the Sign Function, and Diagonalization in Nearly Matrix Multiplication Time

Download PDF

Jess Banks¹,
Jorge Garza-Vargas¹,
Archit Kulkarni¹ &
…
Nikhil Srivastava¹

2224 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

We exhibit a randomized algorithm which, given a square matrix $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$ and $\delta >0$, computes with high probability an invertible V and diagonal D such that $ \Vert A-VDV^{-1}\Vert \le \delta $ using $O(T_\mathsf {MM}(n)\log ^2(n/\delta ))$ arithmetic operations, in finite arithmetic with $O(\log ^4(n/\delta )\log n)$ bits of precision. The computed similarity V additionally satisfies $\Vert V\Vert \Vert V^{-1}\Vert \le O(n^{2.5}/\delta )$. Here $T_\mathsf {MM}(n)$ is the number of arithmetic operations required to multiply two $n\times n$ complex matrices numerically stably, known to satisfy $T_\mathsf {MM}(n)=O(n^{\omega +\eta })$ for every $\eta >0$ where $\omega $ is the exponent of matrix multiplication (Demmel et al. in Numer Math 108(1):59–91, 2007). The algorithm is a variant of the spectral bisection algorithm in numerical linear algebra (Beavers Jr. and Denman in Numer Math 21(1-2):143–169, 1974) with a crucial Gaussian perturbation preprocessing step. Our result significantly improves the previously best-known provable running times of $O(n^{10}/\delta ^2)$ arithmetic operations for diagonalization of general matrices (Armentano et al. in J Eur Math Soc 20(6):1375–1437, 2018) and (with regard to the dependence on n) $O(n^3)$ arithmetic operations for Hermitian matrices (Dekker and Traub in Linear Algebra Appl 4:137–154, 1971). It is the first algorithm to achieve nearly matrix multiplication time for diagonalization in any model of computation (real arithmetic, rational arithmetic, or finite arithmetic), thereby matching the complexity of other dense linear algebra operations such as inversion and QR factorization up to polylogarithmic factors. The proof rests on two new ingredients. (1) We show that adding a small complex Gaussian perturbation to any matrix splits its pseudospectrum into n small well-separated components. In particular, this implies that the eigenvalues of the perturbed matrix have a large minimum gap, a property of independent interest in random matrix theory. (2) We give a rigorous analysis of Roberts’ Newton iteration method (Roberts in Int J Control 32(4):677–687, 1980) for computing the sign function of a matrix in finite arithmetic, itself an open problem in numerical analysis since at least 1986.

Two sufficient descent spectral conjugate gradient algorithms for unconstrained optimization with application

Article 13 June 2024

Pascual Jordan: from matrix multiplication to interference law

Article Open access 07 June 2024

Statistics and characterization of matrices by determinant and trace

Article 16 August 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We study the algorithmic problem of approximately finding all of the eigenvalues and eigenvectors of a given arbitrary $n\times n$ complex matrix. While this problem is quite well-understood in the special case of Hermitian matrices (see, e.g., [52]), the general non-Hermitian case has remained mysterious from a theoretical standpoint even after several decades of research. In particular, the currently best-known provable algorithms for this problem run in time $O(n^{10}/\delta ^2)$ [2] or $O(n^c\log (1/\delta ))$ [17] with $c\ge 12$ where $\delta >0$ is the desired accuracy, depending on the model of computation and notion of approximation considered.^{Footnote 1} To be sure, the non-Hermitian case is well-motivated: coupled systems of differential equations, linear dynamical systems in control theory, transfer operators in mathematical physics, and the nonbacktracking matrix in spectral graph theory are but a few situations where finding the eigenvalues and eigenvectors of a non-Hermitian matrix is important.

The key difficulties in dealing with non-normal matrices are the interrelated phenomena of non-orthogonal eigenvectors and spectral instability, the latter referring to extreme sensitivity of the eigenvalues and invariant subspaces to perturbations of the matrix. Non-orthogonality slows down convergence of standard algorithms such as the power method, and spectral instability can force the use of very high precision arithmetic, also leading to slower algorithms. Both phenomena together make it difficult to reduce the eigenproblem to a subproblem by “removing” an eigenvector or invariant subspace, since this can only be done approximately and one must control the spectral stability of the subproblem in order to be able to rigorously reason about it.

In this paper, we overcome these difficulties by identifying and leveraging a phenomenon we refer to as pseudospectral shattering: adding a small complex Gaussian perturbation to any matrix typically yields a matrix with well-conditioned eigenvectors and a large minimum gap between the eigenvalues, implying spectral stability. Previously, even the existence of such a regularizing perturbation with favorable parameters was not known [20]. This result builds on the recent solution of Davies’ conjecture [9] and is of independent interest in random matrix theory, where minimum eigenvalue gap bounds in the non-Hermitian case were previously only known for i.i.d. models [33, 55].

We complement the above by proving that a variant of the well-known spectral bisection algorithm in numerical linear algebra [11] is both fast and numerically stable when run on a pseudospectrally shattered matrix—we call an iterative algorithm numerically stable if it can be implemented using finite precision arithmetic with polylogarithmically many bits, corresponding to a dynamical system whose trajectory to the approximate solution is robust to adversarial noise (see, e.g. [57]). The key step in the bisection algorithm is computing the sign function of a matrix, a problem of independent interest in many areas such as control theory and approximation theory [44]. Our main algorithmic contribution is a rigorous analysis of the well-known Newton iteration method [53] for computing the sign function in finite arithmetic, showing that it converges quickly and numerically stably on matrices for which the sign function is well-conditioned, in particular on pseudospectrally shattered ones.

The end result is an algorithm which reduces the general diagonalization problem to a polylogarithmic (in the desired accuracy and dimension n) number of invocations of standard numerical linear algebra routines (multiplication, inversion, and QR factorization), each of which is reducible to matrix multiplication [22], yielding a nearly matrix multiplication runtime for the whole algorithm. This improves on the previously best-known running time of $O(n^3+n^2\log (1/\delta ))$ arithmetic operations even in the Hermitian case ([21], see also [41, 52]), and yields the same improvement for the related problem of computing the singular value decomposition of a matrix.

We now proceed to give precise mathematical formulations of the eigenproblem and computational model, followed by statements of our results and a detailed discussion of related work.

1.1 Problem Statement

An eigenpair of a matrix $A\in \mathbb {C}^{n\times n}$ is a tuple $(\lambda , v)\in \mathbb {C}\times \mathbb {C}^n$ such that

$$\begin{aligned} Av=\lambda v, \end{aligned}$$

and v is normalized to be a unit vector. The eigenproblem is the problem of finding a maximal set of linearly independent eigenpairs $(\lambda _i,v_i)$ of a given matrix A; note that an eigenvalue may appear more than once if it has geometric multiplicity greater than one. In the case when A is diagonalizable, the solution consists of exactly n eigenpairs, and if A has distinct eigenvalues then the solution is unique, up to the phases of the $v_i$.

1.1.1 Accuracy and Conditioning

Due to the Abel–Ruffini theorem, it is impossible to have a finite-time algorithm which solves the eigenproblem exactly using arithmetic operations and radicals. Thus, all we can hope for is approximate eigenvalues and eigenvectors, up to a desired accuracy $\delta >0$. There are two standard notions of approximation. We assume $\Vert A\Vert \le 1$ for normalization, where throughout this work, $\Vert \cdot \Vert $ denotes the spectral norm (the $\ell ^2 \rightarrow \ell ^2$ operator norm).

Forward Approximation. Compute pairs $(\lambda _i',v_i')$ such that

$$\begin{aligned} |\lambda _i-\lambda _i'|\le \delta \quad \text {and}\quad \Vert v_i-v_i'\Vert \le \delta \end{aligned}$$

for the true eigenpairs $(\lambda _i,v_i)$, i.e., find a solution close to the exact solution. This makes sense in contexts where the exact solution is meaningful, e.g., the matrix is of theoretical/mathematical origin, and unstable (in the entries) quantities such as eigenvalue multiplicity can have a significant meaning.

Backward Approximation. Compute $(\lambda _i',v_i')$ which are the exact eigenpairs of a matrix $A'$ satisfying

$$\begin{aligned} \Vert A'-A\Vert \le \delta , \end{aligned}$$

i.e., find the exact solution to a nearby problem. This is the appropriate and standard notion in scientific computing, where the matrix is of physical or empirical origin and is not assumed to be known exactly (and even if it were, roundoff error would destroy this exactness). Note that since diagonalizable matrices are dense in $\mathbb {C}^{n\times n}$, one can hope to always find a complete set of eigenpairs for some nearby $A'=VDV^{-1}$, yielding an approximate diagonalization of A:

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \delta . \end{aligned}$$

(1)

Note that the eigenproblem in either of the above formulations is not easily reducible to the problem of computing eigenvalues, since they can only be computed approximately and it is not clear how to obtain approximate eigenvectors from approximate eigenvalues. We now introduce a condition number for the eigenproblem, which measures the sensitivity of the eigenpairs of a matrix to perturbations and allows us to relate its forward and backward approximate solutions.

Condition Numbers. For diagonalizable A, the eigenvector condition number of A, denoted $\kappa _V(A)$, is defined as:

$$\begin{aligned} \kappa _V(A):=\inf _{V} \Vert V\Vert \Vert V^{-1}\Vert , \end{aligned}$$

(2)

where the infimum is over all invertible V such that $A=VDV^{-1}$ for some diagonal D, and its minimum eigenvalue gap is defined as:

$$\begin{aligned} \mathrm {gap}(A):=\min _{i\ne j}|\lambda _i(A)-\lambda _j(A)|, \end{aligned}$$

where $\lambda _i$ are the eigenvalues of A (with multiplicity).

We define the condition number of the eigenproblem to be^{Footnote 2}:

$$\begin{aligned} \kappa _{\mathrm {eig}}(A):=\frac{\kappa _V(A)}{\mathrm {gap}(A)}\in [0,\infty ]. \end{aligned}$$

(3)

It follows from the proposition below (whose proof appears in Sect. 2.2) that a $\delta $-backward approximate solution of the eigenproblem is a $6n\kappa _{\mathrm {eig}}(A)\delta $-forward approximate solution.^{Footnote 3}

Proposition 1.1

If $\Vert A\Vert ,\Vert A'\Vert \le 1$, $\Vert A-A'\Vert \le \delta $, and $\{(v_i,\lambda _i)\}_{i\le n}$, $\{(v_i',\lambda _i')\}_{i\le n}$ are eigenpairs of $A,A'$ with distinct eigenvalues, and $\delta < \frac{\mathrm {gap}(A)}{8 \kappa _V(A)}$, then

$$\begin{aligned} \Vert v_i'-v_i\Vert \le 2n\kappa _{\mathrm {eig}}(A) \delta \quad \text { and }\quad \Vert \lambda _i'-\lambda _i\Vert \le \kappa _V(A) \delta \le 2\kappa _{\mathrm {eig}}(A) \delta \quad \forall i=1,\ldots ,n,\nonumber \\ \end{aligned}$$

(4)

after possibly multiplying the $v_i$ by phases.

Note that $\kappa _{\mathrm {eig}}=\infty $ if and only if A has a double eigenvalue; in this case, a relation like (4) is not possible since different infinitesimal changes to A can produce macroscopically different eigenpairs.

In this paper we will present a backward approximation for the eigenproblem with running time scaling polynomially in $\log (1/\delta )$, which by (4) yields a forward approximation algorithm with running time scaling polynomially in $\log (1/\kappa _{\mathrm {eig}}\delta )$.

Remark 1.2

(Multiple Eigenvalues) A backward approximation algorithm for the eigenproblem can be used to accurately find bases for the eigenspaces of matrices with multiple eigenvalues, but quantifying the forward error requires introducing condition numbers for invariant subspaces rather than eigenpairs. A standard treatment of this can be found in any numerical linear algebra textbook, e.g. [26], and we do not discuss it further in this paper for simplicity of exposition.

1.1.2 Models of Computation

These questions may be studied in various computational models: exact real arithmetic (i.e., infinite precision), variable precision rational arithmetic (rationals are stored exactly as numerators and denominators), and finite precision arithmetic (real numbers are rounded to a fixed number of bits which may depend on the input size and accuracy). Only the last two models yield actual Boolean complexity bounds, but introduce a second source of error stemming from the fact that computers cannot exactly represent real numbers.

We study the third model in this paper, axiomatized as follows.

Finite Precision Arithmetic. We use the standard floating point axioms from [39]. Numbers are stored and manipulated approximately up to some machine precision ${\textbf {u }}:={\textbf {u }}(\delta ,n)>0$, which for us will depend on the instance size n and desired accuracy $\delta $. This means every number $x\in \mathbb {C}$ is stored as $\mathsf {fl}(x)=(1+\Delta )x$ for some adversarially chosen $\Delta \in \mathbb {C}$ satisfying $|\Delta |\le {\textbf {u }}$, and each arithmetic operation $\circ \in \{+,-,\times ,\div \}$ is guaranteed to yield an output satisfying

$$\begin{aligned} \mathsf {fl}(x\circ y) = (x\circ y)(1+\Delta )\quad |\Delta |\le {\textbf {u }}. \end{aligned}$$

It is also standard and convenient to assume that we can evaluate $\sqrt{x}$ for any $x\in \mathbb {R}$, where again $\mathsf {fl}(\sqrt{x}) = \sqrt{x} (1 + \Delta )$ for $|\Delta | \le {\textbf {u }}$.

Thus, the outcomes of all operations are adversarially noisy due to roundoff. The bit lengths of numbers stored in this form remain fixed at $\lg (1/{\textbf {u }})$, where $\lg $ denotes the logarithm base 2. The bit complexity of an algorithm is therefore the number of arithmetic operations times $O^*(\log (1/{\textbf {u }}))$, the running time of standard floating point arithmetic, where the $*$ suppresses $\log \log (1/{\textbf {u }})$ factors. We will state all running times in terms of arithmetic operations accompanied by the required number of bits of precision, which thereby immediately imply bit complexity bounds.

Remark 1.3

(Overflow, Underflow, and Additive Error) Using p bits for the exponent in the floating-point representation allows one to represent numbers with magnitude in the range $[2^{-2^p},2^{2^p}]$. It can be easily checked that all of the nonzero numbers, norms, and condition numbers appearing during the execution of our algorithms lie in the range $[2^{-\lg ^c(n/\delta )},2^{\lg ^c(n/\delta )}]$ for some small c, so overflow and underflow do not occur. In fact, we could have analyzed our algorithm in a computational model where every number is simply rounded to the nearest rational with denominator $2^{\lg ^c(n/\delta )}$—corresponding to additive arithmetic errors. We have chosen to use the multiplicative error floating point model since it is the standard in numerical analysis, but our algorithms do not exploit any subtleties arising from the difference between the two models.

The advantages of the floating point model are that it is realistic and potentially yields very fast algorithms by using a small number of bits of precision (polylogarithmic in n and $1/\delta $), in contrast to rational arithmetic, where even a simple operation such as inverting an $n\times n$ integer matrix requires n extra bits of precision (see, e.g., Chapter 1 of [35]). An iterative algorithm that can be implemented in finite precision (typically, polylogarithmic in the input size and desired accuracy) is called numerically stable.

The disadvantage of the model is that it is only possible to compute forward approximations of quantities which are well-conditioned in the input—in particular, discontinuous quantities such as eigenvalue multiplicity cannot be computed in the floating point model, since it is not even assumed that the input is stored exactly.

1.2 Results and Techniques

In addition to $\kappa _{\mathrm {eig}}$, we will need some more refined quantities to measure the stability of the eigenvalues and eigenvectors of a matrix to perturbations, and to state our results regarding it. The most important of these is the $\epsilon $-pseudospectrum, defined for any $\epsilon >0$ and $M\in \mathbb {C}^{n\times n}$ as:

$$\begin{aligned} \Lambda _\epsilon (M)&:= \left\{ \lambda \in \mathbb {C}: \lambda \in \Lambda (M + E) \text { for some }\Vert E\Vert < \epsilon \right\} \end{aligned}$$

(5)

$$\begin{aligned}&= \left\{ \lambda \in \mathbb {C}: \left\| (\lambda - M)^{-1}\right\| > 1/\epsilon \right\} \end{aligned}$$

(6)

where $\Lambda (\cdot )$ denotes the spectrum of a matrix. The equivalence of (5) and (6) is simple and can be found in the excellent book [62].

Eigenvalue Gaps, $\kappa _V$, and Pseudospectral Shattering. The key probabilistic result of the paper is that a random complex Gaussian perturbation of any matrix yields a nearby matrix with large minimum eigenvalue gap and small $\kappa _V$.

Theorem 1.4

(Smoothed Analysis of $\mathrm {gap}$ and $\kappa _V$) Suppose $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$, and $\gamma \in (0,1/2)$. Let $G_n$ be an $n\times n$ matrix with i.i.d. complex Gaussian $N(0,1_\mathbb {C}/n)$ entries, and let $X:=A+\gamma G_n$. Then

$$\begin{aligned} \kappa _V(X)\le \frac{n^2}{\gamma }, \quad \mathrm {gap}(X)\ge \frac{\gamma ^4}{n^5}, \quad \text {and}\quad \Vert G_n\Vert \le 4, \end{aligned}$$

with probability at least $1-12/n$.

The proof of Theorem 1.4 appears in Sect. 3.1. The key idea is to first control $\kappa _V(X)$ using [9] and then observe that for a matrix with small $\kappa _V$, two eigenvalues of X near a complex number z imply a small second-least singular value of $z-X$, which we are able to control.

In Sect. 3.2 we develop the notion of pseudospectral shattering, which is implied by Theorem 1.4 and says roughly that the pseudospectrum consists of n components that lie in separate squares of an appropriately coarse grid in the complex plane. This is useful in the analysis of the spectral bisection algorithm in Sect. 5.

Matrix Sign Function. The sign function of a number $z\in \mathbb {C}$ with ${\text {Re}}(z)\ne 0$ is defined as $+1$ if ${\text {Re}}(z)>0$ and $-1$ if ${\text {Re}}(z)<0$. The matrix sign function of a matrix A with Jordan normal form

$$\begin{aligned} A = V\begin{bmatrix} N &{} \\ &{} P\end{bmatrix}V^{-1}, \end{aligned}$$

where N (resp. P) has eigenvalues with strictly negative (resp. positive) real part, is defined as

$$\begin{aligned} \mathrm {sgn}(A) = V\begin{bmatrix} -I_N &{} \\ &{} I_P\end{bmatrix} V^{-1}, \end{aligned}$$

where $I_P$ denotes the identity of the same size as P. The sign function is undefined for matrices with eigenvalues on the imaginary axis. Quantifying this discontinuity, Bai and Demmel [4] defined the following condition number for the sign function:

$$\begin{aligned} \kappa _{\mathrm {sign}}(M) := \inf \left\{ 1/\epsilon ^2 : \Lambda _\epsilon (M) \text { does not intersect the imaginary axis} \right\} , \end{aligned}$$

(7)

and gave perturbation bounds for $\mathrm {sgn}(M)$ depending on $\kappa _{\mathrm {sign}}$.

Roberts [53] showed that the simple iteration

$$\begin{aligned} A_{k+1}=\frac{A_k+A_k^{-1}}{2} \end{aligned}$$

(8)

converges globally and quadratically to $\mathrm {sgn}(A)$ in exact arithmetic, but his proof relies on the fact that all iterates of the algorithm are simultaneously diagonalizable, a property which is destroyed in finite arithmetic since inversions can only be done approximately.^{Footnote 4} In Sect. 4 we show that this iteration is indeed convergent when implemented in finite arithmetic for matrices with small $\kappa _{\mathrm {sign}}$, given a numerically stable matrix inversion algorithm. This leads to the following result:

Theorem 1.5

(Sign Function Algorithm) There is a deterministic algorithm $\mathsf {SGN}$ which on input an $n \times n$ matrix A with $\Vert A\Vert \le 1$, a number K with $K \ge \kappa _{\mathrm {sign}}(A)$, and a desired accuracy $\beta \in (0, 1/12)$, outputs an approximation $\mathsf {SGN}(A)$ with

$$\begin{aligned} \Vert \mathsf {SGN}(A)-\mathrm {sgn}(A)\Vert \le \beta , \end{aligned}$$

in

$$\begin{aligned} O( (\log K + \log \log (1/\beta )) T_{\mathsf {INV}}(n) ) \end{aligned}$$

(9)

arithmetic operations on a floating point machine with

$$\begin{aligned} \lg (1/{\textbf {u }}) = O(\log n \log ^3 K (\log K + \log (1/\beta ))) \end{aligned}$$

bits of precision, where $T_\mathsf {INV}(n)$ denotes the number of arithmetic operations used by a numerically stable matrix inversion algorithm (satisfying Definition 2.7).

The main new idea in the proof of Theorem 1.5 is to control the evolution of the pseudospectra $\Lambda _{\epsilon _k}(A_k)$ of the iterates with appropriately decreasing (in k) parameters $\epsilon _k$, using a sequence of carefully chosen shrinking contour integrals in the complex plane. The pseudospectrum provides a richer induction hypothesis than scalar quantities such as condition numbers, and allows one to control all quantities of interest using the holomorphic functional calculus. This technique is introduced in Sects. 4.1 and 4.2, and carried out in finite arithmetic in Sect. 4.3, yielding Theorem 1.5.

Diagonalization by Spectral Bisection. Given an algorithm for computing the sign function, there is a natural and well-known approach to the eigenproblem pioneered in [11]. The idea is that the matrices $(I\pm \mathrm {sgn}(A))/2$ are spectral projectors onto the invariant subspaces corresponding to the eigenvalues of A in the left and right open half planes, so if some shifted matrix $z + A$ or $z + iA$ has roughly half its eigenvalues in each half plane, the problem can be reduced to smaller subproblems appropriate for recursion.

The two difficulties in carrying out the above approach are: (a) efficiently computing the sign function (b) finding a balanced splitting along an axis that is well-separated from the spectrum. These are nontrivial even in exact arithmetic, since the iteration (8) converges slowly if (b) is not satisfied, even without roundoff error. We use Theorem 1.4 to ensure that a good splitting always exists after a small Gaussian perturbation of order $\delta $, and Theorem 1.5 to compute splittings efficiently in finite precision. Combining this with well-understood techniques such as rank-revealing QR factorization, we obtain the following theorem, whose proof appears in Sect. 5.1.

Theorem 1.6

(Backward Approximation Algorithm) There is a randomized algorithm $\mathsf {EIG}$ which on input any matrix $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$ and a desired accuracy parameter $\delta >0$ outputs a diagonal D and invertible V such that

$$\begin{aligned} \quad \Vert A-VDV^{-1}\Vert \le \delta \quad \mathrm {and}\quad \kappa (V) \le 32n^{2.5}/\delta \end{aligned}$$

in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O(\log ^4(n/\delta )\log n) \end{aligned}$$

bits of precision, with probability at least $1-14/n$. Here $T_\mathsf {MM}(n)$ refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Since there is a correspondence in terms of the condition number between backward and forward approximations, and as it is customary in numerical analysis, our discussion revolves around backward approximation guarantees. For convenience of the reader, we write down below the explicit guarantees that one gets by using (4) and invoking $\mathsf {EIG}$ with accuracy $\frac{\delta }{6n \kappa _{\mathrm {eig}}}$.

Corollary 1.7

(Forward Approximation Algorithm) There is a randomized algorithm which on input any matrix $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$, a desired accuracy parameter $\delta >0$, and an estimate $K\ge \kappa _{\mathrm {eig}}(A)$ outputs a $\delta $-forward approximate solution to the eigenproblem for A in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n K}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O(\log ^4(nK/\delta )\log n) \end{aligned}$$

bits of precision, with probability at least $1-1/n-12/n^2$. Here $T_\mathsf {MM}(n)$ refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Remark 1.8

(Accuracy vs. Precision) The gold standard of “backward stability” in numerical analysis postulates that

$$\begin{aligned} \log (1/{\textbf {u }}) = \log (1/\delta ) + \log (n), \end{aligned}$$

i.e., the number of bits of precision is linear in the number of bits of accuracy. The relaxed notion of “logarithmic stability” introduced in [23] requires

$$\begin{aligned} \log (1/{\textbf {u }}) = \log (1/\delta )+O(\log ^c(n)\log (\kappa )) \end{aligned}$$

for some constant c, where $\kappa $ is an appropriate condition number. In comparison, Theorem 1.6 obtains the weaker relationship

$$\begin{aligned} \log (1/{\textbf {u }}) = O(\log ^4(1/\delta )\log (n) + \log ^5(n)), \end{aligned}$$

which is still polylogarithmic in n in the regime $\delta =1/\mathrm {poly}(n)$.

1.3 Related Work

Minimum Eigenvalue Gap. The minimum eigenvalue gap of random matrices has been studied in the case of Hermitian and unitary matrices, beginning with the work of Vinson [64], who proved an $\Omega (n^{-4/3})$ lower bound on this gap in the case of the Gaussian Unitary Ensemble (GUE) and the Circular Unitary Ensemble (CUE). Bourgade and Ben Arous [3] derived exact limiting formulas for the distributions of all the gaps for the same ensembles. Nguyen, Tao, and Vu [50] obtained non-asymptotic inverse polynomial bounds for a large class of non-integrable Hermitian models with i.i.d. entries (including Bernoulli matrices).

In a different direction, Aizenman et al. proved an inverse-polynomial bound [1] in the case of an arbitrary Hermitian matrix plus a GUE matrix or a Gaussian orthogonal ensemble (GOE) matrix, which may be viewed as a smoothed analysis of the minimum gap. Theorem 3.6 may be viewed as a non-Hermitian analogue of the last result.

In the non-Hermitian case, Ge [33] obtained an inverse polynomial bound for i.i.d. matrices with real entries satisfying some mild moment conditions, and [55]^{Footnote 5} proved an inverse polynomial lower bound for the complex Ginibre ensemble. Theorem 3.6 may be seen as a generalization of these results to non-centered complex Gaussian matrices.

Smoothed Analysis and Free Probability. The study of numerical algorithms on Gaussian random matrices (i.e., the case $A=0$ of smoothed analysis) dates back to [25, 29, 56, 65]. The powerful idea of improving the conditioning of a numerical computation by adding a small amount of Gaussian noise was introduced by Spielman and Teng in [59], in the context of the simplex algorithm. Sankar, Spielman, and Teng [54] showed that adding real Gaussian noise to any matrix yields a matrix with polynomially bounded condition number; [9] can be seen as an extension of this result to the condition number of the eigenvector matrix, where the proof crucially requires that the Gaussian perturbation is complex rather than real. The main difference between our results and most of the results on smoothed analysis (including [2]) is that our running time depends logarithmically rather than polynomially on the size of the perturbation.

The broad idea of regularizing the spectral instability of a nonnormal matrix by adding a random matrix can be traced back to the work of Śniady [58] and Haagerup and Larsen [37] in the context of Free Probability theory.

Matrix Sign Function. The matrix sign function was introduced by Zolotarev in 1877. It became a popular topic in numerical analysis following the work of Beavers and Denman [10, 11, 27] and Roberts [53], who used it first to solve the algebraic Ricatti and Lyapunov equations and then as an approach to the eigenproblem; see [44] for a broad survey of its early history. The numerical stability of Roberts’ Newton iteration was investigated by Byers [14], who identified some cases where it is and isn’t stable. Malyshev [46], Byers et al. [15], Bai et al. [5], and Bai and Demmel [4] studied the condition number of the matrix sign function, and showed that if the Newton iteration converges then it can be used to obtain a high-quality invariant subspace^{Footnote 6}, but did not prove convergence in finite arithmetic and left this as an open question.^{Footnote 7} The key issue in analyzing the convergence of the iteration is to bound the condition numbers of the intermediate matrices that appear, as N. Higham remarks in his 2008 textbook:

Of course, to obtain a complete picture, we also need to understand the effect of rounding errors on the iteration prior to convergence. This effect is surprisingly difficult to analyze. $\ldots $ Since errors will in general occur on each iteration, the overall error will be a complicated function of $\kappa _{sign}(X_k)$ and $E_k$ for all k. $\ldots $ We are not aware of any published rounding error analysis for the computation of sign(A) via the Newton iteration.—[40, Section 5.7]

This is precisely the problem solved by Theorem 1.5, which is as far as we know the first provable algorithm for computing the sign function of an arbitrary matrix which does not require computing the Jordan form.

In the special case of Hermitian matrices, Higham [38] established efficient reductions between the sign function and the polar decomposition. Byers and Xu [16] proved backward stability of a certain scaled version of the Newton iteration for Hermitian matrices, in the context of computing the polar decomposition. Higham and Nakatsukasa [49] (see also the improvement [48]) proved backward stability of a different iterative scheme for computing the polar decomposition, and used it to give backward stable spectral bisection algorithms for the Hermitian eigenproblem with $O(n^3)$-type complexity.

Non-Hermitian Eigenproblem. Floating Point Arithmetic. The eigenproblem has been thoroughly studied in the numerical analysis community, in the floating point model of computation. While there are provably fast and accurate algorithms in the Hermitian case (see the next subsection) and a large body of work for various structured matrices (see, e.g., [13]), the general case is not nearly as well-understood. As recently as 1997, J. Demmel remarked in his well-known textbook [26]: “$\ldots $ the problem of devising an algorithm [for the non-Hermitian eigenproblem] that is numerically stable and globally (and quickly!) convergent remains open.”

Demmel’s question remained entirely open until 2015, when it was answered in the following sense by Armentano, Beltrán, Bürgisser, Cucker, and Shub in the remarkable paper [2]. They exhibited an algorithm (see their Theorem 2.28) which given any $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$ and $\sigma >0$ produces in $O(n^{9}/\sigma ^2)$ expected arithmetic operations the diagonalization of the nearby random perturbation $A+\sigma G$ where G is a matrix with standard complex Gaussian entries. By setting $\sigma $ sufficiently small, this may be viewed as a backward approximation algorithm for diagonalization, in that it solves a nearby problem essentially exactly^{Footnote 8}—in particular, by setting $\sigma =\delta /\sqrt{n}$ and noting that $\Vert G\Vert =O(\sqrt{n})$ with very high probability, their result implies a running time of $O(n^{10}/\delta ^2)$ in our setting. Their algorithm is based on homotopy continuation methods, which they argue informally are numerically stable and can be implemented in finite precision arithmetic. Our algorithm is similar on a high level in that it adds a Gaussian perturbation to the input and then obtains a high accuracy forward approximate solution to the perturbed problem. The difference is that their overall running time depends polynomially rather than logarithmically on the accuracy $\delta $ desired with respect to the original unperturbed problem (Table 1).

Table 1 Results for finite-precision floating-point arithmetic

Full size table

Other Models of Computation. If we relax the requirements further and ask for any provable algorithm in any model of Boolean computation, there is only one more positive result with a polynomial bound on the number of bit operations: Jin Yi Cai showed in 1994 [17] that given a rational $n\times n$ matrix A with integer entries of bit length a, one can find an $\delta $-forward approximation to its Jordan Normal Form $A=VJV^{-1}$ in time $\mathrm {poly}(n,a,\log (1/\delta ))$, where the degree of the polynomial is at least 12. This algorithm works in the rational arithmetic model of computation, so it does not quite answer Demmel’s question since it is not a numerically stable algorithm. However, it enjoys the significant advantage of being able to compute forward approximations to discontinuous quantities such as the Jordan structure (Table 2).

Table 2 Results for other models of arithmetic

Full size table

As far as we are aware, there are no other published provably polynomial-time algorithms for the general eigenproblem. The two standard references for diagonalization appearing most often in theoretical computer science papers do not meet this criterion. In particular, the widely cited work by Pan and Chen [51] proves that one can compute the eigenvalues of A in $O(n^\omega + n\log \log (1/\delta ))$ (suppressing logarithmic factors) arithmetic operations by finding the roots of its characteristic polynomial, which becomes a bound of $O(n^{\omega +1}a+n^2\log (1/\delta )\log \log (1/\delta ))$ bit operations if the characteristic polynomial is computed exactly in rational arithmetic and the matrix has entries of bit length a. However that paper does not give any bound for the amount of time taken to find approximate eigenvectors from approximate eigenvalues, and states this as an open problem.^{Footnote 9}

Finally, the important work of Demmel et al. [22] (see also the followup [6]), which we rely on heavily, does not claim to provably solve the eigenproblem either—it bounds the running time of one iteration of a specific algorithm, and shows that such an iteration can be implemented numerically stably, without proving any bound on the number of iterations required in general.

Hermitian Eigenproblem. For comparison, the eigenproblem for Hermitian matrices is much better understood. We cannot give a complete bibliography of this huge area, but mention one relevant landmark result: the work of Wilkinson [66], who exhibited a globally convergent diagonalization algorithm, and the work of Dekker and Traub [21] who quantified the rate of convergence of Wilkinson’s algorithm and from which it follows that the Hermitian eigenproblem can be solved with backward error $\delta $ in $O(n^3+n^2\log (1/\delta ))$ arithmetic operations in exact arithmetic.^{Footnote 10} We refer the reader to [52, §8.10] for the simplest and most insightful proof of this result, due to Hoffman and Parlett [41]

There has also recently been renewed interest in this problem in the theoretical computer science community, with the goal of bringing the runtime close to $O(n^\omega )$: Louis and Vempala [45] show how to find a $\delta $-approximation of just the largest eigenvalue in $O(n^\omega \log ^4(n)\log ^2(1/\delta ))$ bit operations, and Ben-Or and Eldar [12] give an $O(n^{\omega +1}\mathrm {polylog}(n))$-bit-operation algorithm for finding a $1/\mathrm {poly}(n)$-approximate diagonalization of an $n\times n$ Hermitian matrix normalized to have $\Vert A\Vert \le 1$.

Remark 1.9

(Davies’ Conjecture) The beautiful paper [20] introduced the idea of approximating a matrix function f(A) for nonnormal A by $f(A+E)$ for some well-chosen E regularizing the eigenvectors of A. This directly inspired our approach to solving the eigenproblem via regularization.

The existence of an approximate diagonalization (1) for every A with a well-conditioned similarity V (i.e, $\kappa (V)$ depending polynomially on $\delta $ and n) was precisely the content of Davies’ conjecture [20], which was recently solved by some of the authors and Mukherjee in [9]. The existence of such a V is a pre-requisite for proving that one can always efficiently find an approximate diagonalization in finite arithmetic, since if $\Vert V\Vert \Vert V^{-1}\Vert $ is very large it may require many bits of precision to represent. Thus, Theorem 1.6 can be viewed as an efficient algorithmic answer to Davies’ question.

Remark 1.10

(Subsequent work in Random Matrix Theory) Since the first version of the present paper was made public there have been some advances in random matrix theory [8, 43] that prove analogues of Theorem 1.4 in the case where $G_n$ is replaced by a perturbation with random real independent entries. These results formally articulate that, in the context of this paper, there is nothing special about complex Ginibre matrices, and that the same regularization effect can be achieved using a broader class of perturbations. Bounding the eigenvector condition number and the eigenvalue gap when the random perturbation has real entries poses interesting technical challenges that were tackled in different ways in the aforementioned papers. We also refer the reader to [18] where optimal results were obtained in the case where $A=0$ and $G_n$ has real Gaussian entries.

Remark 1.11

(Alternate Proofs using [2]) In October 2021 (about two years after the first appearance of this paper), we noticed that a version of Theorem 1.4 (with a worse $\kappa _V$ bound but a better eigenvalue gap bound) as well as the main theorem of [9] (with a slightly worse dependence on n) can be easily derived from some auxiliary results shown in [2] (specifically Proposition 2.7 and Theorem 2.14 of that paper), which we were not previously aware of. We present these short alternate proofs in “Appendix D”. We remark that our original proofs are essentially different from those appearing in [2]—in particular, they rely on studying the area of pseudospectra, whereas the proof of Theorem 2.14 of [2] relies on geometric concepts and the coarea formula for Gaussian integrals of certain determinantal quantities on Riemannian manifolds. The proofs based on pseudospectra are arguably more flexible; as mentioned in Remark 1.10, they have been recently generalized to ensembles besides the complex Ginibre ensemble, which seems difficult to do for the more algebraic proofs of [2].

Reader Guide. This paper contains a lot of parameters and constants. On first reading, it may be good to largely ignore the constants not appearing in exponents and to keep in mind the typical setting $\delta =1/\mathrm {poly}(n)$ for the accuracy, in which case the important auxiliary parameters $\omega , 1-\alpha , \epsilon , \beta , \eta $ are all $1/\mathrm {poly}(n)$, and the machine precision is $\log (1/{\textbf {u }})=\mathrm {polylog}(n)$.

2 Preliminaries

Let $M \in \mathbb {C}^{n\times n}$ be a complex matrix, not necessarily normal. We will write matrices and vectors with uppercase and lowercase letters, respectively. Let us denote by $\Lambda (M)$ the spectrum of M and by $\lambda _i(M)$ its individual eigenvalues. In the same way we denote the singular values of M by $\sigma _i(M)$ and we adopt the convention $\sigma _1(M) \ge \sigma _2(M) \ge \cdots \ge \sigma _n(M)$. When M is clear from the context we will simplify notation and just write $\Lambda , \lambda _i$ or $\sigma _i$, respectively.

Recall that the operator norm of M is

$$\begin{aligned} \Vert M\Vert = \sigma _1(M) = \sup _{\Vert x \Vert = 1}\Vert Mx\Vert . \end{aligned}$$

As usual, we will say that M is diagonalizable if it can be written as $M = VDV^{-1}$ for some diagonal matrix D whose nonzero entries contain the eigenvalues of M. In this case, we have the spectral expansion

$$\begin{aligned} M = \sum _{i=1}^n \lambda _i v_i w_i^*, \end{aligned}$$

(10)

where the right and left eigenvectors $v_i$ and $w_j^*$ are the columns and rows of V and $V^{-1}$, respectively, normalized so that $w^*_i v_i = 1$.

2.1 Spectral Projectors and Holomorphic Functional Calculus

Let $M \in \mathbb {C}^{n\times n}$, with eigenvalues $\lambda _1,...,\lambda _n$. We say that a matrix P is a spectral projector for M if $MP = PM$ and $P^2 = P$. For instance, each of the terms $v_i w_i^*$ appearing in the spectral expansion (10) is a spectral projector, as $Av_iw_i^*= \lambda _i v_i w_i^*= v_i w_i^*A$ and $w_i^*v_i = 1$. If $\Gamma _i$ is a simple closed positively oriented rectifiable curve in the complex plane separating $\lambda _i$ from the rest of the spectrum, then it is well known that

$$\begin{aligned} v_i w_i^*= \frac{1}{2\pi i}\oint _{\Gamma _i} (z - M)^{-1}d z, \end{aligned}$$

by taking the Jordan normal form of the resolvent $(z - M)^{-1}$ and applying Cauchy’s integral formula.^{Footnote 11}

Since every spectral projector P commutes with M, its range agrees exactly with an invariant subspace of M. We will often find it useful to choose some region of the complex plane bounded by a simple closed positively oriented rectifiable curve $\Gamma $, and compute the spectral projector onto the invariant subspace spanned by those eigenvectors whose eigenvalues lie inside $\Gamma $. Such a projector can be computed by a contour integral analogous to the above.

Recall that if f is any function, and M is diagonalizable, then we can meaningfully define $f(M) := V f(D) V^{-1}$, where f(D) is simply the result of applying f to each element of the diagonal matrix D. The holomorphic functional calculus gives an equivalent definition that extends to the case when M is non-diagonalizable. As we will see, it has the added benefit that bounds on the norm of the resolvent of M can be converted into bounds on the norm of f(M).

Proposition 2.1

(Holomorphic Functional Calculus) Let M be any matrix, $B \supset \Lambda (M)$ be an open neighborhood of its spectrum (not necessarily connected), and $\Gamma _1,...,\Gamma _k$ be simple closed positively oriented rectifiable curves in B whose interiors together contain all of $\Lambda (M)$. Then if f is holomorphic on B, the definition

$$\begin{aligned} f(M) := \frac{1}{2\pi i}\sum _{j=1}^k \oint _{\Gamma _j} f(z)(z - M)^{-1}d z \end{aligned}$$

is an algebra homomorphism in the sense that $(fg)(M) = f(M)g(M)$ for any f and g holomorphic on B.

Finally, we will frequently use the resolvent identity

$$\begin{aligned} (z - M)^{-1} - (z - M')^{-1} = (z-M)^{-1}(M - M')(z - M')^{-1} \end{aligned}$$

to analyze perturbations of contour integrals.

2.2 Pseudospectrum and Spectral Stability

The $\epsilon $-pseudospectrum of a matrix is defined in (5). Directly from this definition, we can relate the pseudospectra of a matrix and a perturbation of it.

Proposition 2.2

([62], Theorem 52.4) For any $n \times n$ matrices M and E and any $\epsilon > 0$, $\Lambda _{\epsilon - \Vert E\Vert }(M) \subseteq \Lambda _\epsilon (M+E)$.

It is also immediate that $\Lambda (M) \subset \Lambda _\epsilon (M)$, and in fact a stronger relationship holds as well:

Proposition 2.3

([62], Theorem 4.3) For any $n \times n$ matrix M, any bounded connected component of $\Lambda _\epsilon (M)$ must contain an eigenvalue of M.

Several other notions of stability will be useful to us as well. If M has distinct eigenvalues $\lambda _1,\ldots ,\lambda _n$, and spectral expansion as in (10), we define the eigenvalue condition number of $\lambda _i$ to be

$$\begin{aligned} \kappa (\lambda _i) := \left\| {v_i w_i^*}\right\| =\Vert v_i\Vert \Vert w_i\Vert . \end{aligned}$$

By considering the scaling of V in (2) in which its columns $v_i$ have unit length, so that $\kappa (\lambda _i) = \Vert w_i \Vert $, we obtain the useful relationship

$$\begin{aligned} \kappa _V(M)\le \Vert V \Vert \Vert V^{-1} \Vert \le \Vert V\Vert _F\Vert V^{-1}\Vert _F\le \sqrt{n\cdot \sum _{i\le n} \kappa (\lambda _i)^2}. \end{aligned}$$

(11)

Note also that the eigenvector condition number and pseudospectrum are related as follows:

Lemma 2.4

([62]) Let D(z, r) denote the open disk of radius r centered at $z \in \mathbb {C}$. For every $M \in \mathbb {C}^{n\times n}$,

$$\begin{aligned} \bigcup _i D(\lambda _i,\epsilon )\subset \Lambda _\epsilon (M)\subset \bigcup _i D(\lambda _i, \epsilon \kappa _V(M)). \end{aligned}$$

(12)

In this paper we will repeatedly use that assumptions about the pseudospectrum of a matrix can be turned into stability statements about functions applied to the matrix via the holomorphic functional calculus. Here we describe an instance of particular importance.

Let $\lambda _i$ be a simple eigenvalue of M and let $\Gamma _i$ be a contour in the complex plane, as in Sect. 2.1, separating $\lambda _i$ from the rest of the spectrum of M, and assume $\Lambda _\epsilon (M)\cap \Gamma =\emptyset $. Then, for any $\Vert M-M'\Vert< \eta <\epsilon $, a combination of Proposition 2.2 and Proposition 2.3 implies that there is a unique eigenvalue $\lambda _i'$ of $M'$ in the region enclosed by $\Gamma $, and furthermore $\Lambda _{\epsilon -\eta }(M')\cap \Gamma = \emptyset $. If $v_i'$ and $w_i'$ are the right and left eigenvectors of $M'$ corresponding to $\lambda _i'$ we have

$$\begin{aligned} \Vert v_i'w_i'^*- v_iw_i^*\Vert&= \frac{1}{2\pi } \left\| \oint _\Gamma (z - M)^{-1} - (z-M')^{-1} d z \right\| \nonumber \\&= \frac{1}{2\pi }\left\| \oint _\Gamma (z - M)^{-1}(M - M')(z-M')^{-1} d z \right\| \nonumber \\&\le \frac{\ell (\Gamma )}{2\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

(13)

We have introduced enough tools to prove Proposition 1.1.

Proof of Proposition 1.1

For $t\in [0, 1]$ define $A(t) = (1-t)A+ tA' $. Since $\delta <\frac{\mathrm {gap}(A)}{8\kappa _V(A)}$ the Bauer–Fike theorem implies that A(t) has distinct eigenvalues for all t, and in fact $\mathrm {gap}(A(t))\ge \frac{3\mathrm {gap}(A)}{4}$. Standard results in perturbation theory (for instance [34, Theorem 1] or any of the references therein) imply that for every $i=1, \dots , n$, A(t) has a unique eigenvalue $\lambda _i(t)$ such that $\lambda _i(t)$ is a differentiable trajectory, $\lambda _i(0) =\lambda _i$ and $\lambda _i(1)=\lambda _i'$. Let $P_i(t)$ be the associated spectral projector of $\lambda _i(t)$, which is uniquely defined via a contour integral, and write $P_i = P_i(0)$.

Let $\Gamma _i$ be the positively oriented contour forming the boundary of the closed disk centered at $\lambda _i$ with radius $\mathrm {gap}(A)/2$, and define $\epsilon =\frac{\mathrm {gap}(A)}{2\kappa _V(A)}$. Lemma 2.4 implies $\Lambda _{\epsilon }(A)$ is contained in the union of these disks over all $i \in [n]$, and for fixed $t\in [0, 1]$, since $\Vert A-A(t)\Vert < t \delta \le \epsilon /4$, Proposition 2.2 gives the same containment for $\Lambda _{3\epsilon /4}(A(t))$. Since these disks intersect only in their boundaries (if they do at all), $\Vert (z - A)^{-1}\Vert \le 1/\epsilon $ and $\Vert (z - A(t))^{-1}\Vert \le 4/3\epsilon $ for $z \in \Gamma _i$. By the derivation of (13) above,

$$\begin{aligned} |\kappa (\lambda _i)-\kappa (\lambda _i(t))|\le & {} \Vert P_i(t) - P_i\Vert \le \frac{\ell (\Gamma _i)}{2\pi } \cdot \frac{1}{\epsilon }\cdot \frac{4}{3\epsilon }\cdot \frac{\epsilon }{4}\\= & {} \frac{\mathrm {gap}(A)}{2}\frac{2\kappa _V(A)}{3\mathrm {gap}(A)} = \frac{\kappa _V(A)}{3} \end{aligned}$$

and hence $\kappa (\lambda _i(t))\le \kappa (\lambda _i)+\kappa _V(A)/3 \le 4\kappa _V(A)/3$. Combining this with (11) we obtain

$$\begin{aligned} \kappa _V(A(t)) \le 2 \sqrt{n \cdot \sum _i \kappa (\lambda _i)^2}< 4n \kappa _V(A)/3. \end{aligned}$$

From Theorem 2 of [34] and the subsequent discussion on p. 468, there exist analytic functions $v_i(t)$ satisfying $v_i(0) = v_i$ and $A(t)v_i(t) = \lambda _i(t)v_i(t)$ for all $i \in [n]$ and $t \in [0,1]$, which furthermore admit the bound

$$\begin{aligned} \Vert \dot{v_i}(t)\Vert \le \frac{\kappa _V(A(t))}{\mathrm {gap}(A(t))}\Vert \dot{A}(t)\Vert \Vert v_i(t)\Vert \le \frac{\delta \kappa _V(A(t))}{\mathrm {gap}(A(t)}\Vert v_i(t)\Vert . \end{aligned}$$

However, these $v_i(t)$ need not in general be unit vectors (see [34, Section 3.4] and references for discussion of various normalizations). Therefore set ${\hat{v}}_i(t) = \Vert v_i(t)\Vert ^{-1} v_i(t)$, and note that by an application of the chain rule,

$$\begin{aligned} \Vert \dot{\hat{v_i}}(t)\Vert \le \frac{\delta \kappa _V(A(t))}{\mathrm {gap}(A(t))}. \end{aligned}$$

It then follows that the vectors $v_i' = \hat{v_i}(1)$ for $i \in [n]$ satisfy the conclusion of the theorem, by bounding $\kappa _V(A(t))\le 4n\kappa _V(A)/3$ and $\mathrm {gap}(A(t))\ge \frac{3\mathrm {gap}(A)}{4}$, and integrating the resulting upper bound $\Vert \dot{\hat{v_i}}(t)\Vert \le \frac{16n\delta \kappa _V(A)}{9\mathrm {gap}(A)}$ from $t = 0$ to $t= 1$. $\square $

2.3 Finite-Precision Arithmetic

We briefly elaborate on the axioms for floating-point arithmetic given in Sect. 1.1. Similar guarantees to the ones appearing in that section for scalar-scalar operations also hold for operations such as matrix–matrix addition and matrix-scalar multiplication. In particular, if A is an $n\times n$ complex matrix,

$$\begin{aligned} \mathsf {fl}(A) = A + A \circ \Delta \qquad |\Delta _{i,j}| < {\textbf {u }}. \end{aligned}$$

It will be convenient for us to write such errors in additive, as opposed to multiplicative form. We can convert the above to additive error as follows. Recall that for any $n\times n$ matrix, the spectral norm (the $\ell ^2 \rightarrow \ell ^2$ operator norm) is at most $\sqrt{n}$ times the $\ell ^2 \rightarrow \ell ^1$ operator norm, i.e. the maximal norm of a column. Thus, we have

$$\begin{aligned} \Vert A \circ \Delta \Vert \le \sqrt{n} \max _i \Vert (A\circ \Delta ) e_i\Vert \le \sqrt{n}\max _{i,j}|\Delta _{i,j}| \max _i \Vert A e_i\Vert \le {\textbf {u }}\sqrt{n}\Vert A\Vert . \end{aligned}$$

(14)

For more complicated operations such as matrix–matrix multiplication and matrix inversion, we use existing error guarantees from the literature. This is the subject of Sect. 2.5.

We will also need to compute the trace of a matrix $A \in \mathbb {C}^{n\times n}$, and normalize a vector $x \in \mathbb {C}^n$. Error analysis of these is standard (see for instance the discussion in [39, Chapters 3–4]), and the results in this paper are highly insensitive to the details. For simplicity, calling ${\hat{x}} := x/\Vert x\Vert $, we will assume that

$$\begin{aligned} |\mathsf {fl}\left( \mathrm {Tr}A\right) - \mathrm {Tr}A|&\le n\Vert A\Vert {\textbf {u }} \end{aligned}$$

(15)

$$\begin{aligned} \Vert \mathsf {fl}({\hat{x}}) - {\hat{x}}\Vert&\le n{\textbf {u }}. \end{aligned}$$

(16)

Each of these can be achieved by assuming that ${\textbf {u }}n \le \epsilon $ for some suitably chosen $\epsilon $, independent of n, a requirement which will be depreciated shortly by several tighter assumptions on the machine precision.

Throughout the paper, we will take the pedagogical perspective that our algorithms are games played between the practitioner and an adversary who may additively corrupt each operation. In particular, we will include explicit error terms (always denoted by $E_{(\cdot )}$) in each appropriate step of every algorithm. In many cases we will first analyze a routine in exact arithmetic—in which case the error terms will all be set to zero—and subsequently determine the machine precision ${\textbf {u }}$ necessary so that the errors are small enough to guarantee convergence.

2.4 Sampling Gaussians in Finite Precision

For various parts of the algorithm, we will need to sample from normal distributions. For our model of arithmetic, we assume that the complex normal distribution can be sampled up to machine precision in O(1) arithmetic operations. To be precise, we assume the existence of the following sampler:

Definition 2.5

(Complex Gaussian Sampling)

A $c_{\mathsf {N}}$-stable Gaussian sampler $\mathsf {N}(\sigma )$ takes as input $\sigma \in \mathbb {R}_{\ge 0}$ and outputs a sample of a random variable ${\widetilde{G}} = \mathsf {N}(\sigma )$ with the property that there exists $G \sim N_{\mathbb {C}}(0, \sigma ^2)$ satisfying

$$\begin{aligned} |{\widetilde{G}} - G| \le c_{\mathsf {N}}\sigma \cdot {\textbf {u }}\end{aligned}$$

with probability one, in at most $T_\mathsf {N}$ arithmetic operations for some universal constant $T_\mathsf {N}>0$.

Note that, since the Gaussian distribution has unbounded support, one should only expect the sampler $\mathsf {N}(\sigma )$ to have a relative error guarantee of the sort $|{\widetilde{G}} - G| \le c_{\mathsf {N}}\sigma |G| \cdot {\textbf {u }}$. However, as it will become clear below, we only care about realizations of Gaussians satisfying $|G|<R$, for a certain prespecified $R>0$, and the rare event $|G|>R$ will be accounted for in the failure probability of the algorithm. So, for the sake of exposition we decided to omit the |G| in the bound on $|{\widetilde{G}}-G|$.

We will only sample $O(n^2)$ Gaussians during the algorithm, so this sampling will not contribute significantly to the runtime. Here as everywhere in the paper, we will omit issues of underflow or overflow. Throughout this paper, to simplify some of our bounds, we will also assume that $c_{\mathsf {N}}\ge 1$.

2.5 Black-box Error Assumptions for Multiplication, Inversion, and QR

Our algorithm uses matrix–matrix multiplication, matrix inversion, and QR factorization as primitives. For our analysis, we must therefore assume some bounds on the error and runtime costs incurred by these subroutines. In this section, we first formally state the kind of error and runtime bounds we require, and then discuss some implementations known in the literature that satisfy each of our requirements with modest constants.

Our definitions are inspired by the definition of logarithmic stability introduced in [22]. Roughly speaking, they say that implementing the algorithm with floating point precision ${\textbf {u }}$ yields an accuracy which is at most polynomially or quasipolynomially in n worse than ${\textbf {u }}$ (possibly also depending on the condition number in the case of inversion). Their definition has the property that while a logarithmically stable algorithm is not strictly speaking backward stable, it can attain the same forward error bound as a backward stable algorithm at the cost of increasing the bit length by a polylogarithmic factor. See Section 3 of their paper for a precise definition and a more detailed discussion of how their definition relates to standard numerical stability notions.

Definition 2.6

A $\mu _{\mathsf {MM}}(n)$-stable multiplication algorithm $\mathsf {MM}(\cdot , \cdot )$ takes as input $A,B\in \mathbb {C}^{n\times n}$ and a precision ${\textbf {u }}>0$ and outputs $C=\mathsf {MM}(A, B)$ satisfying

$$\begin{aligned} \Vert C-AB\Vert \le \mu _{\mathsf {MM}}(n) \cdot {\textbf {u }}\Vert A\Vert \Vert B\Vert , \end{aligned}$$

on a floating point machine with precision ${\textbf {u }}$, in $T_\mathsf {MM}(n)$ arithmetic operations.

Definition 2.7

A $(\mu _{\mathsf {INV}}(n), c_\mathsf {INV})$-stable inversion algorithm $\mathsf {INV}(\cdot )$ takes as input $A\in \mathbb {C}^{n\times n}$ and a precision ${\textbf {u }}$ and outputs $C=\mathsf {INV}(A)$ satisfying

$$\begin{aligned} \Vert C-A^{-1}\Vert \le \mu _{\mathsf {INV}}(n)\cdot {\textbf {u }}\cdot \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert , \end{aligned}$$

on a floating point machine with precision ${\textbf {u }}$, in $T_\mathsf {INV}(n)$ arithmetic operations.

Definition 2.8

A $\mu _\mathsf {QR}(n)$-stable QR factorization algorithm $\mathsf {QR}(\cdot )$ takes as input $A\in \mathbb {C}^{n\times n}$ and a precision ${\textbf {u }}$, and outputs $[Q,R]=\mathsf {QR}(A)$ such that

1.
R is exactly upper triangular.
2.
There is a unitary $Q'$ and a matrix $A'$ such that
$$\begin{aligned} Q' A'= R, \end{aligned}$$
(17)
and
$$\begin{aligned} \Vert Q' - Q\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}, \quad \text {and} \quad \Vert A'-A\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}\Vert A\Vert , \end{aligned}$$

on a floating point machine with precision ${\textbf {u }}$. Its running time is $T_\mathsf {QR}(n)$ arithmetic operations.

Remark 2.9

Throughout this paper, to simplify some of our bounds, we will assume that

$$\begin{aligned} 1 \le \mu _\mathsf {MM}(n), \mu _{\mathsf {INV}}(n) , \mu _\mathsf {QR}(n), c_{\mathsf {INV}}\log n. \end{aligned}$$

The above definitions can be instantiated with traditional $O(n^3)$-complexity algorithms for which $\mu _{\mathsf {MM}}, \mu _\mathsf {QR}, \mu _{\mathsf {INV}}$ are all O(n) and $c_\mathsf {INV}=1$ [39]. This yields easily implementable practical algorithms with running times depending cubically on n.

In order to achieve $O(n^\omega )$-type efficiency, we instantiate them with fast-matrix-multiplication-based algorithms and with $\mu (n)$ taken to be a low-degree polynomial [22]. Specifically, the following parameters are known to be achievable.

Theorem 2.10

(Fast and Stable Instantiations of $\mathsf {MM},\mathsf {INV}, \mathsf {QR}$)

1.
If $\omega $ is the exponent of matrix multiplication, then for every $\eta >0$ there is a $\mu _{\mathsf {MM}}(n)$-stable multiplication algorithm with $\mu _{\mathsf {MM}}(n)=n^{c_\eta }$ and $T_\mathsf {MM}(n)=O(n^{\omega +\eta })$, where $c_\eta $ does not depend on n.
2.
Given an algorithm for matrix multiplication satisfying (1), there is a ($\mu _{\mathsf {INV}}(n),c_\mathsf {INV})$ -stable inversion algorithm with
$$\begin{aligned} \mu _{\mathsf {INV}}(n)\le O(\mu _{\mathsf {MM}}(n)n^{\lg (10)}),\quad \quad c_\mathsf {INV}\le 8, \end{aligned}$$
and $T_\mathsf {INV}(n)\le T_\mathsf {MM}(3n)=O(T_\mathsf {MM}(n))$.
3.
Given an algorithm for matrix multiplication satisfying (1), there is a $\mu _\mathsf {QR}(n)$-stable QR factorization algorithm with
$$\begin{aligned} \mu _\mathsf {QR}(n)=O(n^{c_\mathsf {QR}} \mu _{\mathsf {MM}}(n)), \end{aligned}$$
where $c_\mathsf {QR}$ is an absolute constant, and $T_\mathsf {QR}(n)=O(T_\mathsf {MM}(n))$.

In particular, all of the running times above are bounded by $T_\mathsf {MM}(n)$ for an $n\times n$ matrix.

Proof

(1) is Theorem 3.3 of [23]. (2) is Theorem 3.3 (see also equation (9) above its statement) of [22]. The final claim follows by noting that $T_\mathsf {MM}(3n)=O(T_\mathsf {MM}(n))$ by dividing a $3n\times 3n$ matrix into nine $n\times n$ blocks and proceeding blockwise, at the cost of a factor of 9 in $\mu _{\mathsf {INV}}(n)$. (3) appears in Section 4.1 of [22]. $\square $

We remark that for specific existing fast matrix multiplication algorithms such as Strassen’s algorithm, specific small values of $\mu _\mathsf {MM}(n)$ are known (see [23] and its references for details), so these may also be used as a black box, though we will not do this in this paper.

3 Pseudospectral Shattering

This section is devoted to our central probabilistic result, Theorem 1.4, and the accompanying notion of pseudospectral shattering which will be used extensively in our analysis of the spectral bisection algorithm in Sect. 5.

3.1 Smoothed Analysis of Gap and Eigenvector Condition Number

As is customary in the literature, we will refer to an $n\times n$ random matrix $G_n$ whose entries are independent complex Gaussians drawn from $\mathcal {N}(0,1_\mathbb {C}/n)$ as a normalized complex Ginibre random matrix. To be absolutely clear, and because other choices of scaling are quite common, we mean that $\mathbb {E}G_{i,j} = 0$ and $\mathbb {E}|G_{i,j}|^2 = 1/n$.

In the course of proving Theorem 1.4, we will need to bound the probability that the second-smallest singular value of an arbitrary matrix with small Ginibre perturbation is atypically small. We begin with a well-known lower tail bound on the singular values of a Ginibre matrix alone.

Theorem 3.1

([61, Theorem 1.2]) For $G_n$ an $n\times n$ normalized complex Ginibre matrix and for any $\alpha \ge 0$ it holds that

$$\begin{aligned} \mathbb {P}\left[ \sigma _j(G_n) < \frac{\alpha (n-j+1)}{n} \right] \le \left( \sqrt{2e} \, \alpha \right) ^{2(n-j+1)^2}. \end{aligned}$$

As in several of the authors’ earlier work [9], we can transfer this result to case of a Ginibre perturbation via a remarkable coupling result of P. Śniady.

Theorem 3.2

(Śniady [58]) Let $A_1$ and $A_2$ be $n \times n$ complex matrices such that $\sigma _i(A_1) \le \sigma _i(A_2)$ for all $1 \le i \le n$. Assume further that $\sigma _i(A_1) \ne \sigma _j(A_1)$ and $\sigma _i(A_2) \ne \sigma _j(A_2)$ for all $i \ne j$. Then for every $t \ge 0$, there exists a joint distribution on pairs of $n \times n$ complex matrices $(G_1, G_2)$ such that

1.
the marginals $G_1$ and $G_2$ are distributed as normalized complex Ginibre matrices, and
2.
almost surely $\sigma _i(A_1 + \sqrt{t} G_1) \le \sigma _i(A_2 + \sqrt{t} G_2)$ for every i.

Corollary 3.3

For any fixed matrix M and parameters $\gamma , t>0$

$$\begin{aligned} \mathbb {P}[\sigma _{n-1}(M+\gamma G_n)<t]\le (e/2)^4 (tn/\gamma )^8 \le 4(tn/\gamma )^8. \end{aligned}$$

Proof

We would like to apply Theorem 3.2 to $A_1=0$ and $A_2=M$, but the theorem has the technical condition that $A_1$ and $A_2$ have distinct singular values. Taking vanishingly small perturbations of 0 and M satisfying this condition and taking the size of the perturbation to zero, we obtain

$$\begin{aligned} \mathbb {P}[\sigma _{n-1}(M+\gamma G_n)<t] \le \mathbb {P}[\sigma _{n-1}(\gamma G_n)<t] = \mathbb {P}[\sigma _{n-1}(G_n)<t/\gamma ]. \end{aligned}$$

Invoking Theorem 3.1 with $j=n-1$ and $\alpha $ replaced by $tn/2\gamma $ yields the claim.

$\square $

We will need as well the main theorem of [9], which shows that the addition of a small complex Ginibre to an arbitrary matrix tames its eigenvalue condition numbers.

Theorem 3.4

([9, Theorem 1.5]) Suppose $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$ and $\delta \in (0,1)$. Let $G_n$ be a complex Ginibre matrix, and let $\lambda _1,\ldots ,\lambda _n\in \mathbb {C}$ be the (random) eigenvalues of $A+\delta G_n$. Then for every measurable open set $B\subset \mathbb {C},$

$$\begin{aligned} \mathbb {E}\sum _{\lambda _i \in B} \kappa (\lambda _i)^2 \le \frac{n^2}{\pi \delta ^2}\mathrm {vol}(B). \end{aligned}$$

Our final lemma before embarking on the proof in earnest shows that bounds on the j-th smallest singular value and eigenvector condition number are sufficient to rule out the presence of j eigenvalues in a small region. For our particular application, we will take $j=2$.

Lemma 3.5

Let $D(z_0,r) :=\{z \in \mathbb {C} :|z-z_0|<r\}$. If $M\in \mathbb {C}^{n\times n}$ is a diagonalizable matrix with at least j eigenvalues in $D(z_0,r)$ then

$$\begin{aligned} \sigma _{n-j+1}(z_0-M) \le r\kappa _V(M). \end{aligned}$$

Proof

Write $M=VDV^{-1}$. By Courant–Fischer:

$$\begin{aligned} \sigma _{n-j+1}(z_0-M)&= \min _{S:\dim (S)=j}\max _{x\in S\setminus \{0\}} \frac{\Vert V(z_0-D)V^{-1}x\Vert }{\Vert x\Vert }&\\&=\min _{S:\dim (S)=j}\max _{y\in V(S)\setminus \{0\}} \frac{\Vert V(z_0-D)y\Vert }{\Vert Vy\Vert }&\text { setting }y=Vx\\&=\min _{S:\dim (S)=j}\max _{y\in S\setminus \{0\}} \frac{\Vert V(z_0-D)y\Vert }{\Vert Vy\Vert }&\text { since }V \text { is invertible}\\&\le \min _{S:\dim (S)=j}\max _{y\in S\setminus \{0\}} \frac{\Vert V\Vert \Vert (z_0-D)y\Vert }{\sigma _n(V)\Vert y\Vert }&\\&\le \kappa _V(M)\sigma _{n-j+1}(z_0-D).&\end{aligned}$$

Since $z_0-D$ is diagonal its singular values are just $|z_0-\lambda _i|$, so the j-th smallest is at most r, finishing the proof. $\square $

We now present the main tail bound that we use to control the minimum gap and eigenvector condition number.

Theorem 3.6

(Multiparameter Tail Bound) Let $A\in \mathbb {C}^{n\times n}$. Assume $\Vert A\Vert \le 1$ and $\gamma <1/2$, and let $X:=A+\gamma G_n$ where $G_n$ is a complex Ginibre matrix. For every $t,r>0$:

$$\begin{aligned}&\mathbb {P}\big [\kappa _V(X)<t,\, \mathrm {gap}(X)>r, \, \Vert G_n\Vert <4\big ]\ge 1\nonumber \\&\quad -\left( \frac{144}{r^2}\cdot 4(trn/\gamma )^8 + (9n^3/\gamma ^2t^2)+2e^{-2 n}\right) . \end{aligned}$$

(18)

Proof

Write $\Lambda (X):=\{\lambda _1,\ldots ,\lambda _n\}$ for the (random) eigenvalues of $X:=A+\gamma G_n$, in increasing order of magnitude (there are no ties almost surely). Let $\mathcal {N}\subset \mathbb {C}$ be a minimal r/2-net of $B:=D(0,3)$, recalling the standard fact that one exists of size no more than $(3\cdot 4/r)^2=144/r^2$. The most useful feature of such a net is that, by the triangle inequality, for any $a,b \in D(0,3)$ with distance at most r, there is a point $y\in \mathcal {N}$ with $|y-(a+b)/2|<r/2$ satisfying $a,b\in D(y,r)$. In particular, if $\mathrm {gap}(X) < r$, then there are two eigenvalues in the disk of radius r centered at some point $y \in \mathcal {N}$.

Therefore, consider the events

$$\begin{aligned} E_\mathrm {gap}&:= \{\mathrm {gap}(X)<r\} \subset \{\exists y\in \mathcal {N}: |D(y,r)\cap \Lambda (X)|\ge 2\} \\ E_D&:= \{\Lambda (X)\not \subseteq D(0,3)\}\subset \{\Vert G_n\Vert \ge 4\}:=E_G \\ E_\kappa&:= \{\kappa _V(X) > t\} \\ E_y&:=\{\sigma _{n-1}(y-X) < rt\},\qquad y\in \mathcal {N}. \end{aligned}$$

Lemma 3.5 applied to each $y\in \mathcal {N}$ with $j=2$ reveals that

$$\begin{aligned} E_\mathrm {gap}\subseteq E_D\cup E_\kappa \cup \bigcup _{y\in \mathcal {N}} E_y, \end{aligned}$$

whence

$$\begin{aligned} E_\mathrm {gap}\cup E_\kappa \subseteq E_D\cup E_\kappa \cup \bigcup _{y\in \mathcal {N}} E_y. \end{aligned}$$

By a union bound, we have

$$\begin{aligned} \mathbb {P}[E_\mathrm {gap}\cup E_\kappa ] \le \mathbb {P}[E_D\cup E_\kappa ]+|\mathcal {N}|\max _{y\in \mathcal {N}} \mathbb {P}[E_y]. \end{aligned}$$

(19)

From the tail bound on the operator norm of a Ginibre matrix in [9, Lemma 2.2],

$$\begin{aligned} \mathbb {P}[E_D] \le \mathbb {P}[E_G]\le 2e^{-(4-2\sqrt{2})^2n}\le 2e^{-2 n}. \end{aligned}$$

(20)

Observe that by (11),

$$\begin{aligned} \left\{ \kappa _V(X) > \sqrt{n\sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2}\right\} \subset E_D, \end{aligned}$$

since the inequality in the left-hand event must reverse when we sum over all $\lambda _i \in \Lambda (X)$; thus,

$$\begin{aligned} E_\kappa \subset E_D \cup \left\{ \sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 > t^2/n\right\} . \end{aligned}$$

Theorem 3.4 and Markov’s inequality yield

$$\begin{aligned} \mathbb {P}\left[ \sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 > t^2/n\right] \le \mathbb {E}\sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 \frac{n}{t^2} \le \frac{9\pi n^2}{\pi \gamma ^2} \frac{n}{t^2} = \frac{9n^3}{t^2\gamma ^2}. \end{aligned}$$

Thus, we have

$$\begin{aligned} \mathbb {P}[E_\kappa \cup E_D]\le \frac{9n^3}{t^2\gamma ^2} + 2e^{-2 n}. \end{aligned}$$

Corollary 3.3 applied to $M=-y+A$ gives the bound

$$\begin{aligned} \mathbb {P}[E_y]\le 4\left( \frac{trn}{\gamma }\right) ^8, \end{aligned}$$

for each $y\in \mathcal {N}$, and plugging these estimates back into (19) we have

$$\begin{aligned} \mathbb {P}[E_\mathrm {gap}\cup E_\kappa \cup E_D] \le \mathbb {P}[E_\mathrm {gap}\cup E_\kappa \cup E_G]\le \frac{144}{r^2}\cdot 4\left( \frac{trn}{\gamma }\right) ^8 + \frac{9n^3}{\gamma ^2t^2}+2e^{-2 n}, \end{aligned}$$

as desired. $\square $

A specific setting of parameters in Theorem 3.6 immediately yields Theorem 1.4.

Proof of Theorem 1.4

Applying Theorem 3.6 with parameters $ t:=\frac{n^2}{\gamma }$ and $r := \frac{\gamma ^4}{n^5}$, we have

$$\begin{aligned} \mathbb {P}\big [\mathrm {gap}(X)>r,\, \kappa _V(X)<t,\, \Vert G\Vert \le 4\big ]\ge & {} 1-{600} \frac{n^{10}}{\gamma ^8}\left( \frac{\gamma ^2}{n^2}\right) ^8 \nonumber \\&- \frac{9}{n^2}-2e^{-2 n}\ge 1-12/n, \end{aligned}$$

(21)

as desired, where in the last step we use the assumption $\gamma < 1/2$. $\square $

Since it is of independent interest in random matrix theory, we record the best bound on the gap alone that is possible to extract from the theorem above.

Corollary 3.7

(Minimum Gap Bound)

For X as in Theorem 3.6,

$$\begin{aligned} \mathbb {P}[\mathrm {gap}(X)<r]\le 2\cdot 9^{4/5}(144\cdot 4)^{1/5}(n/\gamma )^{2+6/5}r^{6/5}\le 42(n/\gamma )^{3.2}r^{1.2}+ 2e^{-2 n}. \end{aligned}$$

In particular, the probability is o(1) if $r=o((\gamma /n)^{8/3})$.

Proof

Setting

$$\begin{aligned} t^{10} = \frac{9}{144\cdot 4}(\gamma /nr)^6 \end{aligned}$$

in Theorem 3.6 balances the first two terms and yields the advertised bound. $\square $

3.2 Shattering

Propositions 2.2 and 2.3 in the preliminaries together tell us that if the $\epsilon $-pseudospectrum of an $n\times n$ matrix A has n connected components, then each eigenvalue of any size-$\epsilon $ perturbation ${\widetilde{A}}$ will lie in its own connected component of $\Lambda _\epsilon (A)$. The following key definitions make this phenomenon quantitative in a sense which is useful for our analysis of spectral bisection.

Definition 3.8

(Grid) A grid in the complex plane consists of the boundaries of a lattice of squares with lower edges parallel to the real axis. We will write

$$\begin{aligned} \mathsf {grid}(z_0,\omega ,s_1,s_2) \subset \mathbb {C}\end{aligned}$$

to denote an $s_1\times s_2$ grid of $\omega \times \omega $-sized squares and lower left corner at $z_0 \in \mathbb {C}$. Write ${{\,\mathrm{diam}\,}}(\mathsf {g}) := \omega \sqrt{s_1^2 + s_2^2}$ for the diameter of the grid.

Definition 3.9

(Shattering) A pseudospectrum $\Lambda _\epsilon (A)$ is shattered with respect to a grid $\mathsf {g}$ if:

1.
Every square of $\mathsf {g}$ has at most one eigenvalue of A.
2.
$\Lambda _\epsilon (A)\cap \mathsf {g}=\emptyset $.

Observation 3.10

As $\Lambda _\epsilon (A)$ contains a ball of radius $\epsilon $ about each eigenvalue of A, shattering of the $\epsilon $-pseudospectrum with respect to a grid with side length $\omega $ implies $\epsilon \le \omega /2$.

As a warm-up for more sophisticated arguments later on, we give here an easy consequence of the shattering property.

Lemma 3.11

If $\lambda _1, \dots , \lambda _n$ are the eigenvalues of A, and $\Lambda _\epsilon (A)$ is shattered with respect to a grid $\mathsf {g}$ with side length $\omega $, then every eigenvalue condition number satisfies $\kappa (\lambda _i) \le \frac{2\omega }{\pi \epsilon }$.

Proof

Let $v,w^*$ be a right/left eigenvector pair for some eigenvalue $\lambda _i$ of A, normalized so that $w^*v = 1$. Letting $\Gamma $ be the positively oriented boundary of the square of $\mathsf {g}$ containing $\lambda _i$, we can extract the projector $vw^*$ by integrating, and pass norms inside the contour integral to obtain

$$\begin{aligned} \kappa (\lambda _i)&= \Vert vw^*\Vert = \left\| \frac{1}{2\pi i}\oint _\Gamma (z - A)^{-1}d z \right\| \le \frac{1}{2\pi }\oint _\Gamma \left\| (z - A)^{-1}\right\| d z \le \frac{2\omega }{\pi \epsilon }.\nonumber \\ \end{aligned}$$

(22)

In the final step, we have used the fact that, given the definition of pseudospectrum (6) above, $\Lambda _\epsilon (A) \cap \mathsf {g}= \emptyset $ means $\Vert (z - A)^{-1}\Vert \le 1/\epsilon $ on $\mathsf {g}$. $\square $

The theorem below quantifies the extent to which perturbing by a Ginibre matrix results in a shattered pseudospectrum. See Fig. 1 for an illustration in the case where the initial matrix is poorly conditioned. In general, not all eigenvalues need move so far upon such a perturbation, in particular if the respective $\kappa _i$ are small.

Theorem 3.12

(Exact Arithmetic Shattering) Let $A\in {\mathbb {C}}^{n\times n}$ and $X:=A+\gamma G_n$ for $G_n$ a complex Ginibre matrix. Assume $\Vert A\Vert \le 1$ and $0< \gamma < 1/2$. Let $\mathsf {g}:= \mathsf {grid}(z, \omega ,\lceil 8/\omega \rceil , \lceil 8/\omega \rceil )$ with $\omega := \frac{\gamma ^4}{4n^5}$, and z chosen uniformly at random from the square of side $\omega $ cornered at $-4-4i$. Then, $\kappa _V(X)\le n^2/\gamma $, $\Vert A-X\Vert \le 4\gamma $, and $\Lambda _\epsilon (X)$ is shattered with respect to $\mathsf {g}$ for

$$\begin{aligned} \epsilon := \frac{\gamma ^5}{16n^9}, \end{aligned}$$

with probability at least $1-13/n$.

Proof

Condition on the event in Theorem 1.4, so that

$$\begin{aligned} \kappa _V(X)\le \frac{n^2}{\gamma },\quad \Vert X-A\Vert \le 4\gamma ,\quad \text { and } \mathrm {gap}(X)\ge \frac{\gamma ^4}{n^5}=4\omega . \end{aligned}$$

Consider the random grid $\mathsf {g}$. Since D(0, 3) is contained in the square of side length 8 centered at the origin, every eigenvalue of X is contained in one square of $\mathsf {g}$ with probability 1. Moreover, since $\mathrm {gap}(X)>4\omega $, no square can contain two eigenvalues. Let

$$\begin{aligned} \mathsf {dist}_\mathsf {g}(z):=\min _{y\in \mathsf {g}}|z-y|. \end{aligned}$$

Let $\lambda _i := \lambda _i(X)$. We now have for each $\lambda _i$ and every $s < \frac{\omega }{2}$ :

$$\begin{aligned} \mathbb {P}[\mathsf {dist}_\mathsf {g}(\lambda _i)>s] = \frac{(\omega -2s)^2}{\omega ^2} = 1- \frac{4 s}{\omega }+ \frac{4s^2 }{\omega ^2} \ge 1-\frac{4s}{\omega }, \end{aligned}$$

since the distribution of $\lambda _i$ inside its square is uniform with respect to Lebesgue measure. Setting $s=\omega /4n^2$, this probability is at least $1-1/n^2$, so by a union bound

$$\begin{aligned} \mathbb {P}[\min _{i\le n} \mathsf {dist}_\mathsf {g}(\lambda _i)>\omega /4n^2]>1-1/n, \end{aligned}$$

(23)

i.e., every eigenvalue is well-separated from $\mathsf {g}$ with probability $1-1/n$.

We now recall from (12) that

$$\begin{aligned} \Lambda _\epsilon (X)\subset \bigcup _{i\le n} D(\lambda _i, \kappa _V(X)\epsilon ). \end{aligned}$$

Thus, on the events (21) and (23), we see that $\Lambda _\epsilon (X)$ is shattered with respect to $\mathsf {g}$ as long as

$$\begin{aligned} \kappa _V(X)\epsilon < \frac{\omega }{4n^2}, \end{aligned}$$

which is implied by

$$\begin{aligned} \epsilon < \frac{\gamma ^4}{4n^5}\cdot \frac{1}{4n^2} \cdot \frac{\gamma }{n^2} =\frac{\gamma ^5}{16n^9}. \end{aligned}$$

Thus, the advertised claim holds with probability at least

$$\begin{aligned} 1-\frac{1}{n}- \frac{13}{n} = 1 - \frac{13}{n} , \end{aligned}$$

as desired. $\square $

Finally, we show that the shattering property is retained when the Gaussian perturbation is added in finite precision rather than exactly. This also serves as a pedagogical warm-up for our presentation of more complicated algorithms later in the paper: we use E to represent an adversarial roundoff error (as in step 2), and for simplicity neglect roundoff error completely in computations whose size does not grow with n (such as steps 3 and 4, which set scalar parameters).

Theorem 3.13

(Finite Arithmetic Shattering) Assume there is a $c_{\mathsf {N}}$-stable Gaussian sampling algorithm $\mathsf {N}$ satisfying the requirements of Definition 2.5. Then $\mathsf {SHATTER}$ has the advertised guarantees as long as the machine precision satisfies

$$\begin{aligned} {\textbf {u }}\le \frac{1}{2}\frac{\gamma ^5}{16n^9}\cdot \frac{1}{(3+c_{\mathsf {N}})\sqrt{n}}, \end{aligned}$$

(24)

and runs in

$$\begin{aligned} n^2T_\mathsf {N}+ n^2=O(n^2) \end{aligned}$$

arithmetic operations.

Proof

The two sources of error in $\mathsf {SHATTER}$ are:

1.
An additive error of operator norm at most $n\cdot c_{\mathsf {N}}\cdot (1/\sqrt{n})\cdot {\textbf {u }}\le c_{\mathsf {N}}\sqrt{n}\cdot {\textbf {u }}$ from $\mathsf {N}$, by Definition 2.5.
2.
An additive error of norm at most $\sqrt{n}\cdot \Vert X\Vert \cdot {\textbf {u }}\le 3\sqrt{n}{\textbf {u }}$, with probability at least $1-1/n$, from the roundoff E in step 2.

Thus, as long as the precision satisfies (24), we have

$$\begin{aligned} \Vert \mathsf {SHATTER}(A,\gamma )-\mathrm {shatter}(A,\gamma )\Vert \le \frac{1}{2}\frac{\gamma ^5}{16n^9}, \end{aligned}$$

where $\mathrm {shatter}(\cdot )$ refers to the (exact arithmetic) outcome of Theorem 3.12. The correctness of $\mathsf {SHATTER}$ now follows from Proposition 2.2. Its running time is bounded by

$$\begin{aligned} n^2T_\mathsf {N}+ n^2 \end{aligned}$$

arithmetic operations, as advertised. $\square $

4 Matrix Sign Function

The algorithmic centerpiece of this work is the analysis, in finite arithmetic, of a well-known iterative method for approximating to the matrix sign function. Recall from Sect. 1 that if A is a matrix whose spectrum avoids the imaginary axis, then

$$\begin{aligned} \mathrm {sgn}(A) = P_+ - P_- \end{aligned}$$

where the $P_+$ and $P_-$ are the spectral projectors corresponding to eigenvalues in the open right and left half-planes, respectively. The iterative algorithm we consider approximates the matrix sign function by repeated application to A of the function

$$\begin{aligned} g(z) := \frac{1}{2}(z + z^{-1}). \end{aligned}$$

(25)

This is simply Newton’s method to find a root of $z^2 - 1$, but one can verify that the function g fixes the left and right half-planes, and thus we should expect it to push those eigenvalues in the former towards $-1$, and those in the latter towards $+1$.

We denote the specific finite-arithmetic implementation used in our algorithm by $\mathsf {SGN}$; the pseudocode is provided below.

In Sect. 4.1 we briefly discuss the specific preliminaries that will be used throughout this section. In Sect. 4.2 we give a pseudospectral proof of the rapid global convergence of this iteration when implemented in exact arithmetic. In Sect. 4.2 we show that the proof provided in Sect. 4.3 is robust enough to handle the finite arithmetic case; a formal statement of this main result is the content of Theorem 4.9.

4.1 Circles of Apollonius

It has been known since antiquity that a circle in the plane may be described as the set of points with a fixed ratio of distances to two focal points. By fixing the focal points and varying the ratio in question, we get a family of circles named for the Greek geometer Apollonius of Perga. We will exploit several interesting properties enjoyed by these Circles of Apollonius in the analysis below.

More precisely, we analyze the Newton iteration map $g$ in terms of the family of Apollonian circles whose foci are the points $\pm 1 \in \mathbb {C}$. For the remainder of this section we will write $m(z) = \tfrac{1 - z}{1 + z}$ for the Möbius transformation taking the right half-plane to the unit disk, and for each $\alpha \in (0,1)$ we denote by

$$\begin{aligned} \mathsf {C}^{+}_{\alpha } = \left\{ z \in \mathbb {C}: |m(z)| \le \alpha \right\} , \quad \mathsf {C}^{-}_{\alpha } = \{ z \in {\mathbb {C}} : |m(z)|^{-1} \le \alpha \} \end{aligned}$$

the closed region in the right (respectively left) half-plane bounded by such a circle. Write $\partial \mathsf {C}^{+}_{\alpha }$ and $\partial \mathsf {C}^{-}_{\alpha }$ for their boundaries, and $\mathsf {C}_{\alpha } = \mathsf {C}^{+}_{\alpha } \cup \mathsf {C}^{-}_{\alpha }$ for their union. See Fig. 2 for an illustration.

The region $\mathsf {C}^{+}_{\alpha }$ is a disk centered at $\tfrac{1 + \alpha ^2}{1 - \alpha ^2} \in \mathbb {R}$, with radius $\tfrac{2\alpha }{1-\alpha ^2}$, and whose intersection with the real line is the interval $(m(\alpha ),m(\alpha )^{-1})$; $\mathsf {C}^{-}_{\alpha }$ can be obtained by reflecting $\mathsf {C}^{+}_{\alpha }$ with respect to the imaginary axis. For $\alpha> \beta > 0$, we will write

$$\begin{aligned} \mathsf {A}^{+}_{\alpha ,\beta } = \mathsf {C}^{+}_{\alpha } \setminus \mathsf {C}^{+}_{\beta } \end{aligned}$$

for the Apollonian annulus lying inside $\mathsf {C}^{+}_{\alpha }$ and outside $\mathsf {C}^{+}_{\beta }$; note that the circles are not concentric so this is not strictly speaking an annulus, and note also that in our notation this set does not include $\partial \mathsf {C}^{+}_{\beta }$. In the same way define $\mathsf {A}^{-}_{\alpha ,\beta }$ for the left half-plane and write $\mathsf {A}_{\alpha ,\beta } = \mathsf {A}^{+}_{\alpha ,\beta } \cup \mathsf {A}^{-}_{\alpha ,\beta }$.

Observation 4.1

([53]) The Newton map $g$ is a two-to-one map from $\mathsf {C}^{+}_{\alpha }$ to $\mathsf {C}^{+}_{\alpha ^2}$, and a two-to-one map from $\mathsf {C}^{-}_{\alpha }$ to $\mathsf {C}^{-}_{\alpha ^2}$.

Proof

This follows from the fact that for each z in the right half-plane,

$$\begin{aligned} |m(g(z))| = \left| \frac{1 - \tfrac{1}{2}(z + 1/z)}{1 + \tfrac{1}{2}(z + 1/z)}\right| = \left| \frac{(1-z)^2}{(z + 1)^2}\right| = |m(z)|^2 \end{aligned}$$

and similarly for the left half-plane. $\square $

It follows from Observation 4.1 that under repeated application of the Newton map g, any point in the right or left half-plane converges to $+1$ or $-1$, respectively.

4.2 Exact Arithmetic

In this section, we set $A_0 := A$ and $A_{k+1} := g(A_k)$ for all $k \ge 0$. In the case of exact arithmetic, Observation 4.1 implies global convergence of the Newton iteration when A is diagonalizable. For the convenience of the reader, we provide this argument (due to [53]) below.

Proposition 4.2

Let A be a diagonalizable $n \times n$ matrix and assume that $\Lambda (A) \subset \mathsf {C}_{\alpha }$ for some $\alpha \in (0,1)$. Then for every $N \in {\mathbb {N}}$ we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le \frac{4\alpha ^{2^N}}{\alpha ^{2^{N+1}}+1} \cdot \kappa _V(A). \end{aligned}$$

Moreover, when A does not have eigenvalues on the imaginary axis the minimum $\alpha $ for which $\Lambda (A) \subset \mathsf {C}_{\alpha }$ is given by

$$\begin{aligned} \alpha ^2 = \max _{1 \le i \le n} \left\{ 1 - \frac{4|{\text {Re}}(\lambda _i(A))|}{|\lambda _i(A)-\mathrm {sgn}(\lambda _i(A))|^2} \right\} \end{aligned}$$

Proof

Consider the spectral decomposition $A = \sum _{i=1}^n \lambda _i v_i w_i^*,$ and denote by $\lambda _i^{(N)}$ the eigenvalues of $A_N$.

By Observation 4.1, we have that $\Lambda (A_N) \subset \mathsf {C}_{\alpha ^{2^N}}$ and $\mathrm {sgn}(\lambda _i) = \mathrm {sgn}(\lambda _i^{(N)})$. Moreover, $A_N$ and $\mathrm {sgn}(A)$ have the same eigenvectors. Hence

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le \left\| \sum _{{\text {Re}}(\lambda _i) > 0} (\lambda _i^{(N)}-1) v_i w_i^* \right\| + \left\| \sum _{{\text {Re}}(\lambda _i) < 0} (\lambda _i^{(N)}+1) v_i w_i^* \right\| . \end{aligned}$$

(26)

Now we will use that for any matrix X we have that $\Vert X \Vert \le \kappa _V(X) \mathsf {spr}(X)$ where $\mathsf {spr}(X)$ denotes the spectral radius of X. Observe that the spectral radii of the two matrices appearing on the right-hand side of (26) are bounded by $\max _{i} |\lambda _i- \mathrm {sgn}(\lambda _i)|$, which in turn is bounded by the radius of the circle $\mathsf {C}^{+}_{\alpha ^{2^N}}$, namely $2\alpha ^{2^N}/(\alpha ^{2^{N+1}}+1)$. On the other hand, the eigenvector condition number of these matrices is bounded by $\kappa _V(A)$. This concludes the first part of the statement.

In order to compute $\alpha $ note that if $z = x+ i y$ with $ x > 0$, then

$$\begin{aligned} |m(z)|^2 = \frac{(1-x)^2+ y^2}{(1+x)^2+ y^2} = 1- \frac{4x}{(1+x)^2+y^2}, \end{aligned}$$

and analogously when $x < 0$ and we evaluate $|m(z)|^{-2}$. $\square $

The above analysis becomes useless when trying to prove the same statement in the framework of finite arithmetic. This is due to the fact that at each step of the iteration the roundoff error can make the eigenvector condition numbers of the $A_k$ grow. In fact, since $\kappa _V(A_k)$ is sensitive to infinitesimal perturbations whenever $A_k$ has a multiple eigenvalue, it seems difficult to control it against adversarial perturbations as the iteration converges to $\mathrm {sgn}(A_k)$ (which has very high multiplicity eigenvalues). A different approach, also due to [53], yields a proof of convergence in exact arithmetic even when A is not diagonalizable. However, that proof relies heavily on the fact that $m(A_N)$ is an exact power of $m(A_0)$, or more precisely, it requires the sequence $A_k$ to have the same generalized eigenvectors, which is again not the case in the finite arithmetic setting.

Therefore, a robust version, tolerant to perturbations, of the above proof is needed. To this end, instead of simultaneously keeping track of the eigenvector condition number and the spectrum of the matrices $A_k$, we will just show that for certain $\epsilon _k > 0$, the $\epsilon _k$-pseudospectra of these matrices are contained in a certain shrinking region dependent on k. This invariant is inherently robust to perturbations smaller than $\epsilon _k$, unaffected by clustering of eigenvalues due to convergence, and allows us to bound the accuracy and other quantities of interest via the functional calculus. For example, the following lemma shows how to obtain a bound on $\Vert A_N- \mathrm {sgn}(A)\Vert $ solely using information from the pseudospectrum of $A_N$.

Lemma 4.3

(Pseudospectral Error Bound) Let A be any $n \times n$ matrix and let $A_N$ be the Nth iterate of the Newton iteration under exact arithmetic. Assume that $\epsilon _N > 0$ and $\alpha _N \in (0, 1)$ satisfy $\Lambda _{\epsilon _N}(A_N) \subset \mathsf {C}_{\alpha _N}$. Then, we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A)\Vert \le \frac{8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N}. \end{aligned}$$

(27)

Proof

Note that $\mathrm {sgn}(A) = \mathrm {sgn}(A_N)$. Using the functional calculus, we get

$$\begin{aligned}&\Vert A_N - \mathrm {sgn}(A_N) \Vert = \left\| \frac{1}{2\pi i} \oint _{\partial \mathsf {C}_{\alpha _N}} z(z-A_N)^{-1}\,dz \right. \\&\qquad \left. - \frac{1}{2\pi i}\left( \oint _{\partial \mathsf {C}^{+}_{\alpha _N}} (z-A_N)^{-1}\,dz - \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} (z-A_N)^{-1}\,dz\right) \right\| \\&\quad = \left\| \frac{1}{2\pi i}\oint _{\partial \mathsf {C}^{+}_{\alpha _N}} z (z-A_N)^{-1} - (z-A_N)^{-1}\,dz \right. \\&\qquad \left. + \frac{1}{2\pi i} \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} z (z-A_N)^{-1} + (z-A_N)^{-1}\,dz \right\| \\&\quad \le \frac{1}{2\pi } \left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _N}} (z - 1) (z-A_N)^{-1} \,dz \right\| + \frac{1}{2\pi } \left\| \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} (z + 1) (z-A_N)^{-1} \,dz\right\| \\&\quad \le 2 \cdot \frac{1}{2\pi } \ell (\partial \mathsf {C}^{+}_{\alpha _{N}} ) \sup \{|z - 1| : z \in \mathsf {C}^{+}_{\alpha _N}\} \frac{1}{\epsilon _N} \\&\quad = \frac{ 4 \alpha _N}{1 - \alpha _N^2} \left( \frac{1+\alpha _N}{1-\alpha _N} - 1\right) \frac{1}{\epsilon _N} \\&\quad = \frac{ 8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N}. \end{aligned}$$

$\square $

In view of Lemma 4.3, we would now like to find sequences $\alpha _k$ and $\epsilon _k$ such that

$$\begin{aligned} \Lambda _{\epsilon _k}(A_k)\subset \mathsf {C}_{\alpha _k} \end{aligned}$$

and $\alpha _k^2/\epsilon _k$ converges rapidly to zero. The dependence of this quantity on the square of $\alpha _k$ turns out to be crucial. As we will see below, we can find such a sequence with $\epsilon _k$ shrinking roughly at the same rate as $\alpha _k$. This yields quadratic convergence, which will be necessary for our bound on the required machine precision in the finite arithmetic analysis of Sect. 4.3.

The lemma below is instrumental in determining the sequences $\alpha _k, \epsilon _k$.

Lemma 4.4

(Key Lemma) If $\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }$, then for every $\alpha '>\alpha ^2$, we have $\Lambda _{\epsilon '}(g(A))\subset \mathsf {C}_{\alpha '}$ where

$$\begin{aligned} \epsilon ' := \epsilon \, \frac{(\alpha ' - \alpha ^2)(1-\alpha ^2)}{8\alpha }. \end{aligned}$$

Proof

From the definition of pseudospectrum, our hypothesis implies $\Vert (z - A)^{-1}\Vert < 1/\epsilon $ for every z outside of $\mathsf {C}_{\alpha }$. The proof will hinge on the observation that, for each $\alpha ' \in (\alpha ^2,\alpha )$, this resolvent bound allows us to bound the resolvent of $g(A)$ everywhere in the Apollonian annulus $\mathsf {A}_{\alpha ,\alpha '}$.

Let $w \in \mathsf {A}_{\alpha ,\alpha '}$; see Fig. 3 for an illustration. We must show that $w \not \in \Lambda _{\epsilon '}(g(A))$. Since $w \not \in \mathsf {C}_{\alpha ^2}$, Observation 4.1 ensures no $z \in \mathsf {C}_{\alpha }$ satisfies $g(z) = w$; in other words, the function $(w - g(z))^{-1}$ is holomorphic in z on $\mathsf {C}_{\alpha }$. As $\Lambda (A) \subset \Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }$, Observation 4.1 also guarantees that $\Lambda (g(A)) \subset \mathsf {C}_{\alpha ^2}$. Thus for w in the union of the two Apollonian annuli in question, we can calculate the resolvent of $g(A)$ at w using the holomorphic functional calculus:

$$\begin{aligned} (w - g(A))^{-1} = \frac{1}{2\pi i}\oint _{\partial \mathsf {C}_{\alpha }} (w - g(z))^{-1}(z - A)^{-1}d z, \end{aligned}$$

where by this we mean to sum the integrals over $\partial \mathsf {C}^{+}_{\alpha }$ and $\partial \mathsf {C}^{-}_{\alpha }$, both positively oriented. Taking norms, passing inside the integral, and applying Observation 4.1 one final time, we get:

$$\begin{aligned} \left\| (w - g(A))^{-1} \right\|&\le \frac{1}{2\pi }\oint _{\partial \mathsf {C}_{\alpha }}|(w - g(z))^{-1}|\cdot \Vert (z - A)^{-1}\Vert d z\\&\le \frac{\ell \left( \partial \mathsf {C}^{+}_{\alpha }\right) \sup _{y \in \mathsf {C}^{+}_{\alpha ^2}}|(w - y)^{-1}| + \ell \left( \partial \mathsf {C}^{-}_{\alpha }\right) \sup _{y \in \mathsf {C}^{-}_{\alpha ^2}}|(w - y)^{-1}|}{2\pi \epsilon } \\&\le \frac{1}{\epsilon } \frac{8\alpha }{(\alpha ' - \alpha ^2)(1 - \alpha ^2)}. \end{aligned}$$

In the last step we also use the forthcoming Lemma 4.5. Thus, with $\epsilon '$ defined as in the theorem statement, $\mathsf {A}_{\alpha ,\alpha '}$ contains none of the $\epsilon '$-pseudospectrum of $g(A)$. Since $\Lambda (g(A)) \subset \mathsf {C}_{\alpha ^2}$, Theorem 2.3 tells us that there can be no $\epsilon '$-pseudospectrum in the remainder of $\mathbb {C}\setminus \mathsf {C}_{\alpha '}$, as such a connected component would need to contain an eigenvalue of $g(A)$. $\square $

Lemma 4.5

Let $1> \alpha , \beta > 0$ be given. Then for any $x \in \partial \mathsf {C}_{\alpha }$ and $y \in \partial \mathsf {C}_{\beta }$, we have $|x-y| \ge (\alpha -\beta )/2$.

Proof

Without loss of generality $x \in \partial \mathsf {C}^{+}_{\alpha }$ and $y \in \partial \mathsf {C}^{+}_{\beta }$. Then, we have

$$\begin{aligned} |\alpha - \beta | = \left| |m(x)|- |m(y)|\right| \le |m(x)- m(y)| = \frac{2|x-y|}{{|1+x||1+y|}} \le 2|x-y|. \end{aligned}$$

$\square $

Lemma 4.4 will also be useful in bounding the condition numbers of the $A_k$, which is necessary for the finite arithmetic analysis.

Corollary 4.6

(Condition Number Bound) Using the notation of Lemma 4.4, if $\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }$, then

$$\begin{aligned} \Vert A^{-1}\Vert&\le \frac{1}{\epsilon } \quad \text {and} \quad \Vert A\Vert \le \frac{4\alpha }{(1 - \alpha )^2 \epsilon }. \end{aligned}$$

Proof

The bound $\Vert A^{-1} \Vert \le 1/\epsilon $ follows from the fact that $0 \notin \mathsf {C}_{\alpha } \supset \Lambda _{\epsilon }(A).$ In order to bound A we use the contour integral bound

$$\begin{aligned} \Vert A \Vert&= \left\| \frac{1}{2\pi i} \oint _{\partial \mathsf {C}_{\alpha }} z (z - A)^{-1}\,dz \right\| \\&\le \frac{\ell (\partial \mathsf {C}_{\alpha })}{2\pi } \left( \sup _{z \in \partial \mathsf {C}_{\alpha }} |z| \right) \frac{1}{\epsilon } \\&= \frac{4 \alpha }{1 - \alpha ^2} \frac{1+\alpha }{1-\alpha } \frac{1}{\epsilon }. \\ \end{aligned}$$

$\square $

Another direct application of Lemma 4.4 yields the following.

Lemma 4.7

Let $\epsilon > 0$. If $\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }$, and $ 1/\alpha> D > 1$ then for every N we have the guarantee

$$\begin{aligned} \Lambda _{\epsilon _N}(A_N) \subset \mathsf {C}_{\alpha _N}, \end{aligned}$$

for $\alpha _N =(D\alpha )^{2^N}/D$ and $\epsilon _N = \frac{\alpha _N \epsilon }{\alpha } \left( \frac{(D-1)(1-\alpha ^2)}{8D}\right) ^N $.

Proof

Define recursively $\alpha _0 = \alpha $, $\epsilon _0 = \epsilon $, $\alpha _{k+1} = D \alpha _k^2$ and $\epsilon _{k+1}= \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _0^2).$ It is easy to see by induction that this definition is consistent with the definition of $\alpha _N$ and $\epsilon _N$ given in the statement.

We will now show by induction that $\Lambda _{\epsilon _k}(A_k) \subset \mathsf {C}_{\alpha _k}$. Assume the statement is true for k, so from Lemma 4.4 we have that the statement is also true for $A_{k+1}$ if we pick the pseudospectral parameter to be

$$\begin{aligned} \epsilon ' = \epsilon _k \frac{(\alpha _{k+1}-\alpha _k^2)(1-\alpha _k^2)}{8\alpha _k} = \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _k^2). \end{aligned}$$

On the other hand,

$$\begin{aligned} \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _k^2) \ge \frac{1}{8} \epsilon _k \alpha _k (D-1) (1-\alpha _0^2) = \epsilon _{k+1}, \end{aligned}$$

which concludes the proof of the statement. $\square $

We are now ready to prove the main result of this section, a pseudospectral version of Proposition 4.2.

Proposition 4.8

Let $A\in {\mathbb {C}}^{n\times n}$ be a diagonalizable matrix and assume that $\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }$ for some $\alpha \in (0,1)$. Then, for any $1< D < \frac{1}{\alpha }$ for every N we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le (D\alpha )^{2^N}\cdot \frac{\pi \alpha (1-\alpha ^2)^2}{8\epsilon }\cdot \left( \frac{8D}{(D-1)(1-\alpha ^2)}\right) ^{N+2}. \end{aligned}$$

Proof

Using the choice of $\alpha _k$ and $\epsilon _k$ given in the proof of Lemma 4.7 and the bound (27), we get that

$$\begin{aligned} \Vert A_N- \mathrm {sgn}(A)\Vert&\le \frac{8 \pi \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N} \\&= \frac{8 \pi \alpha _0 \alpha _N }{\epsilon _0 (1- \alpha _N)^2 (1+\alpha _N) } \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\&= (D \alpha _0)^{2^N} \frac{8 D^3 \pi \alpha _0}{(D-(D\alpha _0)^{2^{N}})^2(D+(D\alpha _0)^{2^N}) \epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\ {}&\le (D \alpha _0)^{2^N} \frac{8D^2 \pi \alpha _0}{(D-1)^2 \epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\&= (D\alpha _0)^{2^N}\, \frac{\pi \alpha _0 (1-\alpha _0^2)^2}{8\epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^{N+2}, \end{aligned}$$

where the last inequality was taken solely to make the expression more intuitive, since not much is lost by doing so. $\square $

4.3 Finite Arithmetic

Finally, we turn to the analysis of $\mathsf {SGN}$ in finite arithmetic. By making the machine precision small enough, we can bound the effect of roundoff to ensure that the parameters $\alpha _k$, $\epsilon _k$ are not too far from what they would have been in the exact arithmetic analysis above. We will stop the iteration before any of the quantities involved become prohibitively small, so we will only need $\mathrm {polylog}(1-\alpha _0, \epsilon _0, \beta )$ bits of precision, where $\beta $ is the accuracy parameter.

In exact arithmetic, recall that the Newton iteration is given by $A_{k+1} = g(A_{k}) = \frac{1}{2} (A_k + A_k^{-1}).$ Here we will consider the finite arithmetic version $\mathsf {G}$ of the Newton map $g$, defined as $\mathsf {G}(A) := g(A)+E_A$ where $E_A$ is an adversarial perturbation coming from the roundoff error. Hence, the sequence of interest is given by ${\widetilde{A}}_0 := A$ and ${\widetilde{A}}_{k+1} := \mathsf {G}({\widetilde{A}}_k)$ .

In this subsection we will prove the following theorem concerning the runtime and precision of $\mathsf {SGN}$. Our assumptions on the size of the parameters $\alpha _0, \beta , \mu _{\mathsf {INV}}(n)$ and $c_\mathsf {INV}$ are in place only to simplify the analysis of constants; these assumptions are not required for the execution of the algorithm.

Theorem 4.9

(Main guarantees for $\mathsf {SGN}$) Assume $\mathsf {INV}$ is a $(\mu _{\mathsf {INV}}(n), c_\mathsf {INV})$-stable matrix inversion algorithm satisfying Definition 2.7. Let $\epsilon _0\in (0,1), \beta \in (0,1/12)$, assume $\mu _{\mathsf {INV}}(n) \ge 1$ and $c_\mathsf {INV}\log n \ge 1$, and assume $A = {\widetilde{A}}_0$ is a floating-point matrix with $\epsilon _0$-pseudospectrum contained in $\mathsf {C}_{\alpha _0}$ where $0< 1 - \alpha _0 < 1/100$. Run $\mathsf {SGN}$ with

$$\begin{aligned} N = \lceil \lg (1/(1- \alpha _0)) + 3 \lg \lg (1/(1-\alpha _0)) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil \end{aligned}$$

iterations (as specified in the statement of the algorithm). Then $\widetilde{A_N}=\mathsf {SGN}(A)$ satisfies the advertised accuracy guarantee

$$\begin{aligned} \Vert \widetilde{A_N} - \mathrm {sgn}(A) \Vert \le \beta \end{aligned}$$

when run with machine precision satisfying

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_{\mathsf {SGN}} := \frac{ \alpha _0^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}, \end{aligned}$$

corresponding to at most

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}})) = O( \log n \log ^3(1/(1-\alpha _0)) (\log (1/\beta )+ \log (1/\epsilon _0)))\end{aligned}$$

required bits of precision. The number of arithmetic operations is at most

$$\begin{aligned} N(4 n^2 + T_\mathsf {INV}(n)) . \end{aligned}$$

Later on, we will need to call $\mathsf {SGN}$ on a matrix with shattered pseudospectrum; the lemma below calculates acceptable parameter settings for shattering so that the pseudospectrum is contained in the required pair of Apollonian circles, satisfying the hypothesis of Theorem 4.9.

Lemma 4.10

If A has $\epsilon $-pseudospectrum shattered with respect to a grid $\mathsf {g}= \mathsf {grid}(z_0,\omega ,s_1,s_2)$ that includes the imaginary axis as a grid line, then one has $\Lambda _{\epsilon _0}(A) \subseteq \mathsf {C}_{\alpha _0}$ where $\epsilon _0 = \epsilon /2$ and

$$\begin{aligned} \alpha _0 = 1 - \frac{\epsilon }{{{\,\mathrm{diam}\,}}(\mathsf {g})^2}. \end{aligned}$$

In particular, if $\epsilon $ is at least $1/\mathrm {poly}(n)$ and $\omega s_1$ and $\omega s_2$ are at most $\mathrm {poly}(n$), then $\epsilon _0$ and $1-\alpha _0$ are also at least $1/\mathrm {poly}(n)$.

Proof

First, because it is shattered, the $\epsilon /2$-pseudospectrum of A is at least distance $\epsilon /2$ from $\mathsf {g}$. Recycling the calculation from Proposition 4.2, it suffices to take

$$\begin{aligned} \alpha _0^2 = \max _{z \in \Lambda _{\epsilon /2}(A)}\left( 1 - \frac{4|{\text {Re}}z|}{|z - \mathrm {sgn}(z)|^2}\right) . \end{aligned}$$

From what we just observed about the pseudospectrum, we can take $|{\text {Re}}z| \ge \epsilon /2$. To bound the denominator, we can use the crude bound that any two points inside the grid are at distance no more than ${{\,\mathrm{diam}\,}}(\mathsf {g})$. Finally, we use $\sqrt{1 - x} \le 1 - x/2$ for any $x\in (0,1)$. $\square $

The proof of Theorem 4.9 will proceed as in the exact arithmetic case, with the modification that $\epsilon _k$ must be decreased by an additional factor after each iteration to account for roundoff. At each step, we set the machine precision ${\textbf {u }}$ small enough so that the $\epsilon _k$ remain close to what they would be in exact arithmetic. For the analysis we will introduce an explicit auxiliary sequence $e_k$ that lower bounds the $\epsilon _k$, provided that ${\textbf {u }}$ is small enough.

Lemma 4.11

(One-step additive error) Assume the matrix inverse is computed by an algorithm $\mathsf {INV}$ satisfying the guarantee in Definition 2.7. Then $\mathsf {G}(A) = g(A) + E$ for some error matrix E with norm

$$\begin{aligned} \Vert E \Vert \le \left( \Vert A \Vert + \Vert A^{-1} \Vert + \mu _{\mathsf {INV}}(n) \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert \right) 2 \sqrt{n} {\textbf {u }}.\end{aligned}$$

(28)

The proof of this lemma is deferred to “Appendix A”.

With the error bound for each step in hand, we now move to the analysis of the whole iteration. It will be convenient to define $s := 1 - \alpha _0$, which should be thought of as a small parameter. As in the exact arithmetic case, for $k \ge 1,$ we will recursively define decreasing sequences $\alpha _k$ and $\epsilon _k$ maintaining the property

$$\begin{aligned} \Lambda _{\epsilon _k}({\widetilde{A}}_k) \subset \mathsf {C}_{\alpha _k} \qquad \text { for all }k \ge 0 \end{aligned}$$

(29)

by induction as follows:

1.
The base case $k=0$ holds because by assumption, $\Lambda _{\epsilon _0} \subset \mathsf {C}_{\alpha _0}$.
2.
Here we recursively define $\alpha _{k+1}$. Set
$$\begin{aligned} \alpha _{k+1} := (1 + s/4) \alpha _k^2. \end{aligned}$$
In the notation of Sect. 4.2, this corresponds to setting $D = 1+s/4$. This definition ensures that $\alpha _k^2 \le \alpha _{k+1} \le \alpha _k$ for all k, and also gives us the bound $(1+s/4)\alpha _0 \le 1-s/2$. We also have the closed form
$$\begin{aligned} \alpha _k = (1+s/4)^{2^k - 1} \alpha _0^{2^k}, \end{aligned}$$
which implies the useful bound
$$\begin{aligned} \alpha _k \le (1-s/2)^{2^k}. \end{aligned}$$
(30)
3.
Here we recursively define $\epsilon _{k+1}$. Combining Lemma 4.4, the recursive definition of $\alpha _{k+1}$, and the fact that $1 - \alpha _k^2 \ge 1 - \alpha _0^2 \ge 1 - \alpha _0 = s$, we find that $\Lambda _{\epsilon '}\left( g({\widetilde{A}}_k)\right) \subset \mathsf {C}_{\alpha _{k+1}}$, where
$$\begin{aligned} \epsilon ' = \epsilon _k \frac{\left( \alpha _{k+1} - \alpha _k^2\right) (1-\alpha _k^2)}{8\alpha _k} = \epsilon _k \frac{s\alpha _k(1-\alpha _k^2)}{32} \ge \epsilon _k\frac{ \alpha _k s^2}{32}. \end{aligned}$$
Thus in particular
$$\begin{aligned} \Lambda _{\epsilon _k\alpha _k s^2/32} \left( g({\widetilde{A}}_k)\right) \subset \mathsf {C}_{\alpha _{k+1}}. \end{aligned}$$
Since ${\widetilde{A}}_{k+1} = \mathsf {G}({\widetilde{A}}_k) = g({\widetilde{A}}_k) + E_k$, for some error matrix $E_k$ arising from roundoff, Proposition 2.2 ensures that if we set
$$\begin{aligned} \epsilon _{k+1} := \epsilon _k\frac{ s^2 \alpha _k }{32} - \Vert E_k \Vert \end{aligned}$$
(31)
we will have $\Lambda _{\epsilon _{k+1}}({\widetilde{A}}_{k+1}) \subset \mathsf {C}_{\alpha _{k+1}}, $ as desired.

We now need to show that the $\epsilon _{k}$ do not decrease too fast as k increases. In view of (31), it will be helpful to set the machine precision small enough to guarantee that $\Vert E_k \Vert $ is a small fraction of $\epsilon _k\frac{ \alpha _k s^2}{32}$.

First, we need to control the quantities $\Vert {\widetilde{A}}_k\Vert $, $\Vert {\widetilde{A}}_k^{-1}\Vert $, and $\kappa ({\widetilde{A}}_k) =\Vert {\widetilde{A}}_k\Vert \Vert {\widetilde{A}}_k^{-1}\Vert $ appearing in our upper bound (28) on $\Vert E_k \Vert $ from Lemma 4.11, as functions of $\epsilon _k$. By Corollary 4.6, we have

$$\begin{aligned} \Vert {\widetilde{A}}_k^{-1}\Vert&\le \frac{1}{\epsilon _k} \quad \text {and} \quad \Vert {\widetilde{A}}_k\Vert \le 4\frac{\alpha _k}{(1 - \alpha _k)^2\epsilon _k} \le \frac{4}{s^2 \epsilon _k}. \end{aligned}$$

Thus, we may write the coefficient of ${\textbf {u }}$ in the bound (28) as

$$\begin{aligned} K_{\epsilon _k} := \left[ \frac{4}{s^2\epsilon _k} + \frac{1}{\epsilon _k} + \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2\epsilon _k^2} \right) ^{c_\mathsf {INV}\log n}\frac{1}{\epsilon _k} \right] 2 \sqrt{n} \end{aligned}$$

so that Lemma 4.11 reads

$$\begin{aligned} \Vert {E_k}\Vert \le K_{\epsilon _k} {\textbf {u }}. \end{aligned}$$

(32)

Plugging this into the definition (31) of $\epsilon _{k+1}$, we have

$$\begin{aligned} \epsilon _{k+1} \ge \epsilon _k \frac{s^2\alpha _k}{32} - K_{\epsilon _k} {\textbf {u }}. \end{aligned}$$

(33)

Now suppose we take ${\textbf {u }}$ small enough so that

$$\begin{aligned} K_{\epsilon _k} {\textbf {u }}\le \frac{1}{3} \epsilon _k \frac{s^2\alpha _k}{32}. \end{aligned}$$

(34)

For such ${\textbf {u }}$, we then have

$$\begin{aligned} \epsilon _{k+1} \ge \frac{2}{3}\epsilon _k \frac{s^2\alpha _k}{32} = \frac{1}{48} \epsilon _k s^2 \alpha _k, \end{aligned}$$

(35)

which implies

$$\begin{aligned} \Vert E_k \Vert \le \frac{1}{2} \epsilon _{k+1}; \end{aligned}$$

(36)

this bound is loose but sufficient for our purposes. Inductively, we now have the following bound on $\epsilon _k$ in terms of $\alpha _k$:

Lemma 4.12

(Preliminary lower bound on $\epsilon _k$) Let $k \ge 0$, and for all $0 \le i \le k-1$, assume ${\textbf {u }}$ satisfies the requirement (34):

$$\begin{aligned} K_{\epsilon _i}{\textbf {u }}\le \frac{1}{3} \epsilon _i \frac{s^2 \alpha _i}{32}. \end{aligned}$$

Then, we have

$$\begin{aligned} \epsilon _k \ge e_k := \epsilon _0 \left( \frac{s^2}{50}\right) ^k \alpha _k. \end{aligned}$$

In fact, it suffices to assume the hypothesis only for $i=k-1$.

Proof

The last statement follows from the fact that $\epsilon _i$ is decreasing in i and $K_{\epsilon _i}$ is increasing in i.

Since (34) implies (35), we may apply (35) repeatedly to obtain

$$\begin{aligned} \epsilon _k&\ge \epsilon _0 (s^2/48)^k \prod _{i=0}^{k-1} \alpha _i&\\&= \epsilon _0 (s^2/48)^k (1+s/4)^{2^k - 1 - k}\alpha _0^{2^k-1}&{\text { by the definition of }\alpha _i}\\&= \epsilon _0 \left( \frac{s^2}{48(1+s/4)}\right) ^k \frac{\alpha _k}{\alpha _0}&\\&\ge \epsilon _0 \left( \frac{s^2}{50}\right) ^k \alpha _k.&\alpha _0 \le 1, s < 1/8 \end{aligned}$$

$\square $

We now show that the conclusion of Lemma 4.12 still holds if we replace $\epsilon _i$ everywhere in the hypothesis by $e_i$, which is an explicit function of $\epsilon _0$ and $\alpha _0$ defined in Lemma 4.12. Note that we do not know $\epsilon _i \ge e_i$ a priori, so to avoid circularity we must use a short inductive argument.

Corollary 4.13

(Lower bound on $\epsilon _k$ with explicit hypothesis) Let $k \ge 0$, and for all $0 \le i \le k-1$, assume ${\textbf {u }}$ satisfies

$$\begin{aligned} K_{e_i} {\textbf {u }}\le \frac{1}{3} e_i \frac{s^2 \alpha _i}{32} \end{aligned}$$

(37)

where $e_i$ is defined in Lemma 4.12. Then, we have

$$\begin{aligned} \epsilon _k \ge e_k.\end{aligned}$$

In fact, it suffices to assume the hypothesis only for $i=k-1$.

Proof

The last statement follows from the fact that $e_i$ is decreasing in i and $K_{e_i}$ is increasing in i.

Assuming the full hypothesis of this lemma, we prove $\epsilon _i \ge e_i$ for $0 \le i \le k$ by induction on i. For the base case, we have $\epsilon _0 \ge e_0 = \epsilon _0 \alpha _0$.

For the inductive step, assume $\epsilon _i \ge e_i$. Then as long as $i \le k-1$, the hypothesis of this lemma implies

$$\begin{aligned} K_{\epsilon _i} {\textbf {u }}\le \frac{1}{3} \epsilon _i \frac{s^2 \alpha _i}{32}, \ \end{aligned}$$

so we may apply Lemma 4.12 to obtain $\epsilon _{i+1} \ge e_{i+1}$, as desired. $\square $

Lemma 4.14

(Main accuracy bound) Suppose ${\textbf {u }}$ satisfies the requirement (34) for all $0 \le k \le N$. Then

$$\begin{aligned} \Vert {\widetilde{A}}_N - \mathrm {sgn}(A) \Vert \le \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2} + \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N}. \end{aligned}$$

(38)

Proof

Since $\mathrm {sgn}= \mathrm {sgn}\circ g$, for every k we have

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_k})\Vert&= \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(g(\widetilde{A_k})) \Vert = \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) \\&\quad - \mathrm {sgn}(\widetilde{A_{k+1}} - E_k) \Vert . \end{aligned}$$

From the holomorphic functional calculus we can rewrite $\Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_{k+1}} - E_k) \Vert $ as the norm of a certain contour integral, which in turn can be bounded as follows:

$$\begin{aligned}&\frac{1}{2\pi }\left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} [(z-\widetilde{A_{k+1}})^{-1} - (z - (\widetilde{A_{k+1}} - E_k))^{-1} ]\, dz \right. \\&\qquad \left. -\oint _{\partial \mathsf {C}^{-}_{\alpha _{k+1}}} [(z-\widetilde{A_{k+1}})^{-1} - (z - (\widetilde{A_{k+1}} - E_k))^{-1} ]\, dz \right\| \\&\quad = \frac{1}{2\pi }\left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} [(z - (\widetilde{A_{k+1}} - E_k))^{-1}E_k(z-\widetilde{A_{k+1}})^{-1} ]\, dz \right. \\&\qquad \left. - \oint _{\partial \mathsf {C}^{-}_{\alpha _{k+1}}} [(z - (\widetilde{A_{k+1}} - E_k))^{-1}E_k(z-\widetilde{A_{k+1}})^{-1} ]\, dz \right\| \\&\quad \le \frac{1}{\pi } \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} \Vert (z - (\widetilde{A_{k+1}} - E_k))^{-1}\Vert \Vert E_k \Vert \Vert (z - \widetilde{A_{k+1}})^{-1} \Vert \,dz\\&\quad \le \frac{1}{\pi }\ell (\partial \mathsf {C}^{+}_{\alpha _{k+1}}) \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}} \\&\quad = \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}}, \end{aligned}$$

where we use the definition (6) of pseudospectrum and Proposition 2.2, together with the property (29). Ultimately, this chain of inequalities implies

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_{k}}) \Vert \le \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}}. \end{aligned}$$

Summing over all k and using the triangle inequality, we obtain

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_N}) - \mathrm {sgn}(\widetilde{A_0}) \Vert&\le \sum _{k=1}^{N-1} \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}} \\&\le \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2}, \end{aligned}$$

where in the last step we use $\alpha _k \le 1$ and $1 - \alpha _{k+1}^2 \ge s$, as well as (36).

By Lemma 4.3 (to be precise, by repeating the proof of that lemma with $\widetilde{A_N}$ substituted for $A_N$), we have

$$\begin{aligned} \Vert \widetilde{A_N} - \mathrm {sgn}(\widetilde{A_N})\Vert&\le \frac{8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N} \\&\le \frac{8 }{s^2} \alpha _N \frac{\alpha _N}{\epsilon _N} \\&\le \frac{8}{s^2} \alpha _N\frac{1}{\epsilon _0}\left( \frac{50}{s^2}\right) ^N \\&\le \frac{8}{s^2\epsilon _0} (1-s/2)^{2^N}\left( \frac{50}{s^2}\right) ^N \\&\le \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N}. \end{aligned}$$

where we use $s < 1/2$ in the last step.

Combining the above with the triangle inequality, we obtain the desired bound.

$\square $

We would like to apply Lemma 4.14 to ensure $\Vert \widetilde{A_N} - \mathrm {sgn}(A) \Vert $ is at most $\beta $, the desired accuracy parameter. The upper bound (38) in Lemma 4.14 is the sum of two terms; we will make each term less than $\beta /2$. The bound for the second term will yield a sufficient condition on the number of iterations N. Given that, the bound on the first term will then give a sufficient condition on the machine precision ${\textbf {u }}$. This will be the content of Lemmas 4.16 and 4.17.

We start with the second term. The following preliminary lemma will be useful:

Lemma 4.15

Let $1/800> t > 0$ and $1/2> c > 0$ be given. Then for

$$\begin{aligned} j \ge \lg (1/t) + 2 \lg \lg (1/t) + \lg \lg (1/c) + 1.62, \end{aligned}$$

we have

$$\begin{aligned} \frac{(1-t)^{2^j}}{t^{2j}} < c. \end{aligned}$$

The proof is deferred to “Appendix A”.

Lemma 4.16

(Bound on second term of (38)) Suppose we have

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Then

$$\begin{aligned} \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N} \le \beta /2. \end{aligned}$$

Proof

It is sufficient that

$$\begin{aligned} \frac{8 \cdot 64^N }{s^{2N+2}\epsilon _0}(1 - s/8)^{2^N} \le \beta /2. \end{aligned}$$

The result now follows from applying Lemma 4.15 with $c = \beta s^2 \epsilon _0/16$ and $t=s/8$. $\square $

Now we move to the first term in the bound of Lemma 4.14.

Lemma 4.17

(Bound on first term of (38))

Suppose

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62, \end{aligned}$$

and suppose the machine precision ${\textbf {u }}$ satisfies

$$\begin{aligned} {\textbf {u }}\le \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N} . \end{aligned}$$

Then we have

$$\begin{aligned} \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2} \le \beta /2. \end{aligned}$$

Proof

It suffices to show that for all $0 \le k \le N-1$,

$$\begin{aligned} \Vert E_k \Vert \le \frac{ \beta \epsilon _{k+1}^2 s}{16N}. \end{aligned}$$

In view of (32), which says $\Vert {E_k}\Vert \le K_{\epsilon _k} {\textbf {u }}$, it is sufficient to have for all $0 \le k \le N-1$

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{\epsilon _k}}\frac{ \beta \epsilon _{k+1}^2 s}{16N}. \end{aligned}$$

(39)

For this, we claim it is sufficient to have for all $0 \le k \le N-1$

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{e_k}}\frac{ \beta e_{k+1}^2 s}{16N}. \end{aligned}$$

(40)

Indeed, on the one hand, since $\beta < 1/6$ and by the loose bound $e_{k+1}< s \alpha _{k+1} < s \alpha _k$ we have that (40) implies ${\textbf {u }}\le \frac{1}{3K_{e_k}} \frac{ s^2 e_k}{32}$, which means that the assumption in Corollary 4.13 is satisfied. On the other hand, Corollary 4.13 yields $e_k\le \epsilon _k$ for all $0\le k \le N$, which in turn, combined with (40) would give (39) and conclude the proof.

We now show that (40) holds for all $0\le k\le N-1$. Because $1/K_{e_k}$ and $e_k$ are decreasing in k, it is sufficient to have the single condition

$$\begin{aligned} {\textbf {u }}\le \frac{1}{K_{e_N}}\frac{ \beta e_{N}^2 s}{16N}. \end{aligned}$$

We continue the chain of sufficient conditions on ${\textbf {u }}$, where each line implies the line above:

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{e_N}}\frac{ \beta e_N^2 s}{16N} \\ {\textbf {u }}&\le \frac{1}{\left[ \frac{4}{s^2e_N} + \frac{1}{e_N} + \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2e_N^2} \right) ^{c_\mathsf {INV}\log n}\frac{1}{e_N} \right] 2 \sqrt{n}} \frac{ \beta e_N^2 s}{16N} \\ {\textbf {u }}&\le \frac{1}{6 \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2 e_N} \right) ^{c_\mathsf {INV}\log n + 1} 2\sqrt{n}}\frac{ \beta e_N^2 s}{16 N} \\ {\textbf {u }}&\le \frac{ \beta }{6 \cdot 2 \cdot 16 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{e_N s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

where we use the bound $\frac{1}{e_N} \le \frac{4}{s^2 e_N^2}$ without much loss, and we also use our assumption $\mu _{\mathsf {INV}}(n) \ge 1$ and $c_\mathsf {INV}\log n \ge 1$ for simplicity.

Substituting the value of $e_N$ as defined in Lemma 4.12, we get the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{\epsilon _0 (s^2/50)^N \alpha _N s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

Replacing $\alpha _N$ by the smaller quantity $\alpha _0^{2^N} = (1-s)^{2^N}$ and cleaning up the constants yields the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{\epsilon _0 (s^2/50)^N (1-s)^{2^N} s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

Now we finally will use our hypothesis on the size of N to simplify this expression. Applying Lemma 4.16, we have

$$\begin{aligned} \epsilon _0 (s^2/50)^N /4 \ge \frac{4 (1-s)^{2^N}}{s^2 \beta }. \end{aligned}$$

Thus, our sufficient condition becomes

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{4(1-s)^{2^{N+1}}}{\beta } \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

To make the expression simpler, since $c_\mathsf {INV}\log n + 3 \ge 4$ we may pull out a factor of $4^4 > 192$ and remove the occurrences of $\beta $ to yield the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}. \end{aligned}$$

$\square $

Matching the statement of Theorem 4.9, we give a slightly cleaner sufficient condition on N that implies the hypothesis on N appearing in the above lemmas. The proof is deferred to “Appendix A”.

Lemma 4.18

(Final sufficient condition on N) If

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil , \end{aligned}$$

then

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Taking the logarithm of the machine precision yields the number of bits required:

Lemma 4.19

(Bit length computation) Suppose

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil \end{aligned}$$

and

$$\begin{aligned} {\textbf {u }}_{\mathsf {SGN}} = \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}. \end{aligned}$$

Then

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) = O\big (\log n \log (1/s)^3 (\log (1/\beta ) + \log (1/\epsilon _0))\big ). \end{aligned}$$

Proof

In the course of the proof, for convenience we also record a nonasymptotic bound (for $s<1/100$, $\beta < 1/12$, $\epsilon _0 < 1$ and $c_\mathsf {INV}\log n > 1$ as in the hypothesis of Theorem 4.9), at the cost of making the computation somewhat messier.

Immediately we have

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) \le \lg \mu _{\mathsf {INV}}(n) + \frac{1}{2}\lg n + \lg N + (c_\mathsf {INV}\log n + 3) 2^{N+1} \log (1/(1-s)). \end{aligned}$$

Note that $\log (1/(1-s)) < s$ for $s < 1/2$. Also, $2^{N+1} \le (1/s) \lg (1/s)^3 (\lg (1/\beta ) + \lg (1/\epsilon _0))2^{9.59}.$ Putting this together, we have

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}})&\le \lg \mu _{\mathsf {INV}}(n) + \frac{1}{2}\lg n + \lg N + 1000 (c_\mathsf {INV}\log n + 3)\lg (1/s)^3 (\lg (1/\beta ) \\&\quad + \lg (1/\epsilon _0)). \end{aligned}$$

We now crudely bound $\lg N$. Note that for $s < 1/100$ we have $\lg (1/s) + 3 \lg \lg (1/s) + 7.59 \le 1/s$. Thus,

$$\begin{aligned} \lg N&\le \lg (1/s + \lg \lg (1/(\beta \epsilon _0)))&\\&\le \lg (1/s + \lg (1/(\beta \epsilon _0)))&\\&\le \lg (1/s) + \lg \lg (1/(\beta \epsilon _0))&\lg (a+b) \le \lg a + \lg b \text { for } a,b>2\\&\le \lg (1/s)^3 \lg (1/(\beta \epsilon _0)).&\end{aligned}$$

Combining the above, we may fold the $\lg N$ and $\lg n$ terms into the final term to obtain

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) \le \lg \mu _{\mathsf {INV}}(n) + 5000c_\mathsf {INV}\log n \lg (1/s)^3 (\lg (1/\beta ) + \lg (1/\epsilon _0)) \end{aligned}$$

(41)

where we use that $c_\mathsf {INV}\log n > 1$ and therefore $c_\mathsf {INV}\log n + 3 < 4 c_\mathsf {INV}\log n.$

Using that $\mu _{\mathsf {INV}}(n) = \mathrm {poly}(n)$ and discarding subdominant terms, we obtain the desired asymptotic bound. $\square $

This completes the proof of Theorem 4.9. Finally, we may prove the theorem advertised in Sect. 1.

Proof of Theorem 1.5

Set $\epsilon := \min \{ \frac{1}{K}, 1\}$. Then $\Lambda _\epsilon (A)$ does not intersect the imaginary axis, and furthermore $\Lambda _\epsilon (A) \subseteq D(0, 2)$ because $\Vert A \Vert \le 1$. Thus, we may apply Lemma 4.10 with ${{\,\mathrm{diam}\,}}(\mathsf {g}) = 4\sqrt{2}$ to obtain parameters $\alpha _0, \epsilon _0$ with the property that $\log (1/(1-\alpha _0))$ and $\log (1/\epsilon _0)$ are both $O(\log K)$. Theorem 4.9 now yields the desired conclusion. $\square $

5 Spectral Bisection Algorithm

In this section we will prove Theorem 1.6. As discussed in Sect. 1, our algorithm is not new, and in its idealized form it reduces to the two following tasks:

Split::: Given an $n\times n$ matrix A, find a partition of the spectrum into pieces of roughly equal size, and output spectral projectors $P_{\pm }$ onto each of these pieces.
Deflate::: Given an $n\times n$ rank-k projector P, output an $n\times k$ matrix Q with orthogonal columns that span the range of P.

These routines in hand, on input A one can compute $P_{\pm }$ and the corresponding $Q_{\pm }$, and then find the eigenvectors and eigenvalues of $A_{\pm } := Q_{\pm }^*A Q_{\pm }$. The observation below verifies that this recursion is sound.

Observation 5.1

The spectrum of A is exactly $\Lambda (A_+) \sqcup \Lambda (A_-)$, and every eigenvector of A is of the form $Q_{\pm }v$ for some eigenvector v of one of $A_{\pm }$.

The difficulty, of course, is that neither of these routines can be executed exactly: we will never have access to true projectors $P_{\pm }$, nor to the actual orthogonal matrices $Q_{\pm }$ whose columns span their range, and must instead make do with approximations. Because our algorithm is recursive and our matrices nonnormal, we must take care that the errors in the sub-instances $A_{\pm }$ do not corrupt the eigenvectors and eigenvalues we are hoping to find. Additionally, the Newton iteration we will use to split the spectrum behaves poorly when an eigenvalue is close to the imaginary axis, and it is not clear how to find a splitting which is balanced.

Our tactic in resolving these issues will be to pass to our algorithms a matrix and a grid with respect to which its $\epsilon $-pseudospectrum is shattered. To find an approximate eigenvalue, then, one can settle for locating the grid square it lies in; containment in a grid square is robust to perturbations of size smaller than $\epsilon $. The shattering property is robust to small perturbations, inherited by the subproblems we pass to, and—because the spectrum is quantifiably far from the grid lines—allows us to run the Newton iteration in the first place.

Let us now sketch the implementations and state carefully the guarantees for $\mathsf {SPLIT}$ and $\mathsf {DEFLATE}$; the analysis of these will be deferred to Appendices B and C. Our splitting algorithm is presented a matrix A whose $\epsilon $-pseudospectrum is shattered with respect to a grid $\mathsf {g}$. For any vertical grid line with real part h, $\mathrm {Tr}\, \mathrm {sgn}(A-h)$ gives the difference between the number of eigenvalues lying to its left and right. As

$$\begin{aligned} |\mathrm {Tr}\,\mathsf {SGN}(A-h) - \mathrm {Tr}\, \mathrm {sgn}(A-h)| \le n\Vert \mathsf {SGN}(A-h) - \mathrm {sgn}(A-h)\Vert , \end{aligned}$$

we can determine these eigenvalue counts exactly by running $\mathsf {SGN}$ to accuracy O(1/n) and rounding $\mathrm {Tr}\, \mathsf {SGN}(A-h)$ to the nearest integer. We will show in “Appendix B” that, by mounting a binary search over horizontal and vertical lines of $\mathsf {g}$, we will always arrive at a partition of the eigenvalues into two parts with size at least $\min \{n/5,1\}$. Having found it, we run $\mathsf {SGN}$ one final time at the desired precision to find the approximate spectral projectors.

Theorem 5.2

(Guarantees for $\mathsf {SPLIT}$) Assume $\mathsf {INV}$ is a $(\mu _\mathsf {INV},c_\mathsf {INV})$-stable matrix inversion algorithm satisfying Definition 2.7. Let $\epsilon \le 0.5$, $\beta \le 0.05/n$, and $\Vert A\Vert \le 4$ and $\mathsf {g}$ have side lengths of at most 8, and define

$$\begin{aligned} N_{\mathsf {SPLIT}} := \lg \frac{256}{\epsilon } + 3\lg \lg \frac{256}{\epsilon } + \lg \lg \frac{4}{\beta \epsilon } + 7.59. \end{aligned}$$

Then $\mathsf {SPLIT}$ has the advertised guarantees when run on a floating point machine with precision

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_\mathsf {SPLIT}:= \min \left\{ \frac{\left( 1 - \frac{\epsilon }{256}\right) ^{2^{N_{\mathsf {SPLIT}}+1} (c_{\mathsf {INV}} \log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {SPLIT}}},\, \frac{\epsilon }{100 n},\, \frac{\epsilon ^2}{512} \right\} , \end{aligned}$$

Using at most

$$\begin{aligned} T_{\mathsf {SPLIT}}(n,\mathsf {g},\epsilon ,\beta ) \le 12 \lg \frac{1}{\omega (\mathsf {g})}\cdot N_{\mathsf {SPLIT}} \cdot \left( T_{\mathsf {INV}}(n) + O(n^2) \right) \end{aligned}$$

arithmetic operations. The number of bits required is

$$\begin{aligned} \lg 1/{\textbf {u }}_{\mathsf {SPLIT}} = O\left( \log n \log ^3 \frac{256}{\epsilon }\left( \log \frac{1}{\beta } + \log \frac{4}{\epsilon }\right) \right) . \end{aligned}$$

Deflation of the approximate projectors we obtain from $\mathsf {SPLIT}$ amounts to a standard rank-revealing QR factorization. This can be achieved deterministically in $O(n^3)$ time with the classic algorithm of Gu and Eisenstat [36], or probabilistically in matrix-multiplication time with a variant of the method of [22]; we will use the latter.

Theorem 5.3

(Guarantees for $\mathsf {DEFLATE}$) Assume $\mathsf {MM}$ and $\mathsf {QR}$ are matrix multiplication and QR factorization algorithms satisfying Definitions 2.6 and 2.8. Then $\mathsf {DEFLATE}$ has the advertised guarantees when run on a machine with precision:

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_\mathsf {DEFLATE}:= \min \left\{ \frac{\beta }{ 4\Vert {\widetilde{P}}\Vert \max (\mu _{\mathsf {QR}}(n),\mu _{\mathsf {MM}}(n))}, \frac{\eta }{2\mu _{\mathsf {QR}}(n)}\right\} . \end{aligned}$$

The number of arithmetic operations is at most:

$$\begin{aligned} T_{\mathsf {DEFLATE}}(n) = n^2 T_\mathsf {N}+ 2T_\mathsf {QR}(n)+T_\mathsf {MM}(n). \end{aligned}$$

Remark 5.4

The proof of the above theorem, which is deferred to “Appendix C”, closely follows and builds on the analysis of the randomized rank revealing factorization algorithm ($\mathsf {RURV}$) introduced in [22] and further studied in [7]. The parameters in the theorem are optimized for the particular application of finding a basis for a deflating subspace given an approximate spectral projector.

The main difference with the analysis in [22] and [7] is that here, to make it applicable to complex matrices, we make use of Haar unitary random matrices instead of Haar orthogonal random matrices. In our analysis of the unitary case, we discovered a strikingly simple formula (Corollary C.6) for the density of the smallest singular value of an $r\times r$ sub-matrix of an $n\times n$ Haar unitary; this formula is leveraged to obtain guarantees that work for any n and r, and not only for when $n-r \ge 30$, as was the case in [7]. Finally, we explicitly account for finite arithmetic considerations in the Gaussian randomness used in the algorithm, where true Haar unitary matrices can never be produced.

We are ready now to state completely an algorithm $\mathsf {EIG}$ which accepts a shattered matrix and grid and outputs approximate eigenvectors and eigenvalues with a forward-error guarantee. Aside from the a priori un-motivated parameter settings in lines 2 and 3—which we promise to justify in the analysis to come—$\mathsf {EIG}$ implements an approximate version of the split and deflate framework that began this section.

Theorem 5.5

($\mathsf {EIG}$: Finite Arithmetic Guarantee) Assume $\mathsf {MM}, \mathsf {QR}$, and $\mathsf {INV}$ are numerically stable algorithms for matrix multiplication, QR factorization, and inversion satisfying Definitions 2.6, 2.8, and 2.7. Let $\delta < 1$, $A \in \mathbb {C}^{n\times n}$ have $\Vert A\Vert \le 3.5$ and, for some $\epsilon < 1/2$, have $\epsilon $-pseudospectrum shattered with respect to a grid $\mathsf {g}= \mathsf {grid}(z_0,\omega ,s_1,s_2)$ with side lengths at most 8 and $\omega \le 1$. Define

$$\begin{aligned} N_{\mathsf {EIG}} := \lg \frac{256 n}{\epsilon } + 3\lg \lg \frac{256 n}{\epsilon } + \lg \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59. \end{aligned}$$

Then $\mathsf {EIG}$ has the advertised guarantees when run on a floating point machine with precision satisfying:

$$\begin{aligned} \lg 1/{\textbf {u }}&\ge \max \left\{ \lg ^3 \frac{ n}{\epsilon }\lg \left( \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8} \right) 2^{9.59}(c_{\mathsf {INV}}\log n + 3) + \lg N_{\mathsf {EIG}}, \lg \frac{(5n)^{30}}{\theta ^2\delta ^4\epsilon ^8} \right. \\&\quad \left. + \lg \max \{\mu _{\mathsf {MM}}(n),\mu _{\mathsf {QR}}(n),n\} \right\} \\&= O\left( \log ^3 \frac{n}{\epsilon }\log \frac{n}{\theta \delta \epsilon }\log n\right) . \end{aligned}$$

The number of arithmetic operations is at most

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta , n)&= 60 N_{\mathsf {EIG}}\lg \frac{1}{\omega (\mathsf {g})}\left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 10 T_{\mathsf {QR}}(n) + 25 T_{\mathsf {MM}}(n) \\&= O\left( \log \frac{1}{\omega (\mathsf {g})}\left( \log \frac{n}{\epsilon } + \log \log \frac{1}{\theta \delta }\right) T_{\mathsf {MM}}(n) \right) . \end{aligned}$$

Remark 5.6

We have not fully optimized the large constant $2^{9.59}$ appearing in the bit length above.

Theorem 5.5 easily implies Theorem 1.6 when combined with $\mathsf {SHATTER}$.

Theorem 5.7

(Restatement of Theorem 1.6) There is a randomized algorithm $\mathsf {EIG}$ which on input any matrix $A\in \mathbb {C}^{n\times n}$ with $\Vert A\Vert \le 1$ and a desired accuracy parameter $\delta \in (0,1)$ outputs a diagonal D and invertible V such that

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \delta \quad \mathrm {and}\quad \kappa (V) \le 32n^{2.5}/\delta \end{aligned}$$

in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O\left( \log ^4\frac{n}{\delta }\log n\right) \end{aligned}$$

bits of precision, with probability at least $1-14/n$. Here $T_\mathsf {MM}(n)$ refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Proof

Given A and $\delta $, consider the following two step algorithm:

1.
$(X, \mathsf {g}, \epsilon )\leftarrow \mathsf {SHATTER}(A,\delta /8)$.
2.
$(V,D)\leftarrow \mathsf {EIG}(X,\delta ',\mathsf {g},\epsilon ,1/n,n)$, where
$$\begin{aligned} \delta ' := \frac{\delta ^3}{n^{4.5} \cdot 6 \cdot 128 \cdot 2}. \end{aligned}$$
(42)

With probability at least $1 - 13/n$, $\mathsf {SHATTER}(A,\delta /8)$ succeeds, in which case the output $(X,\mathsf {grid},\epsilon )$ output easily satisfy the assumptions in Theorem 5.5: $\delta ' \le \delta < 1$, $\epsilon = \tfrac{(\delta /8)^5}{32 n^9} \le 1/2$, $\mathsf {g}$ is defined by $\mathsf {SHATTER}$ to have side length 8, $\Vert X\Vert \le \Vert A\Vert + \Vert X - A\Vert \le 1 + 4(\delta /8) \le 3.5$, and X has $\epsilon $-pseudospectrum shattered with respect to $\mathsf {g}$. On this event, $X = WCW^{-1}$, and (using the proof of Theorem 3.6) if we normalize W to have unit length columns, then $\kappa (W) = \Vert W\Vert \Vert W^{-1}\Vert \le 8n^2/\delta $.

We will show that the choice of $\delta '$ in (42) guarantees

$$\begin{aligned} \Vert X-VDV^{-1}\Vert \le \delta /2. \end{aligned}$$

Since $\Vert X\Vert \le \Vert A\Vert + \Vert A - X\Vert \le 1 + 4\gamma \le 3$ from Theorem 3.13, the hypotheses of Theorem 5.5 are satisfied. Thus $\mathsf {EIG}$ succeeds with probability at least $1-1/n$, and by a union bound, both $\mathsf {EIG}$ and $\mathsf {SHATTER}$ succeed with probability at least $1 - 14/n$. On this event, we have $V=W+E$ for some $\Vert E\Vert \le \delta '\sqrt{n}$, so

$$\begin{aligned} \Vert V-W\Vert \le \delta '\sqrt{n}, \end{aligned}$$

as well as

$$\begin{aligned} \sigma _n(V)\ge \sigma _n(W)-\Vert E\Vert \ge \frac{\delta }{8n^2}-\delta '\sqrt{n}\ge \frac{\delta }{16n^2}, \end{aligned}$$

since our choice of $\delta '$ satisfies the much cruder bound of

$$\begin{aligned} \delta '\le \frac{\delta }{16n^{2.5}}, \end{aligned}$$

This implies that

$$\begin{aligned} \kappa (V)=\Vert V\Vert \Vert V^{-1}\Vert \le 2\sqrt{n}\cdot \frac{16n^2}{\delta }, \end{aligned}$$

establishing the last item of the theorem. We can control the perturbation of the inverse as:

$$\begin{aligned} \Vert V^{-1} - W^{-1} \Vert&= \Vert W^{-1} (W - V) V^{-1} \Vert \\&\le \kappa (W)\Vert W - V\Vert \Vert V^{-1}\Vert \\&\le \frac{8n^2}{\delta }\cdot \delta '\sqrt{n} \cdot \frac{16 n^2}{\delta }\\&\le \frac{128 n^{4.5}\delta '}{\delta ^2}. \end{aligned}$$

The grid output by $\mathsf {SHATTER}(A,\delta /8)$ has $\omega = \tfrac{\delta ^4}{4*8^4*n^5} \le \tfrac{\delta }{\sqrt{2}}$ provided $\delta < 1$. Thus the guarantees on $\mathsf {EIG}$ in Theorem 5.5 tell us each eigenvalue of $X = WCW^{-1}$ shares a grid square with exactly one diagonal entry of D, which means that $\Vert C - D\Vert \le \sqrt{2}\omega \le \delta $. So, we have:

$$\begin{aligned} \Vert VDV^{-1}-WCW^{-1}\Vert&\le \Vert (V-W) D V^{-1} \Vert + \Vert W (D-C) V^{-1} \Vert \\&\quad + \Vert WC (V^{-1} - W^{-1}) \Vert \\&\le \delta ' \sqrt{n} \cdot 5 \cdot \frac{16n^2}{\delta } + \sqrt{n} \delta ' \frac{16 n^2}{\delta } + \sqrt{n} \cdot 5 \cdot \frac{128n^{4.5} \delta '}{\delta ^2} \\&= \frac{\delta ' n^{4.5}}{\delta } \left( 5 \cdot 16 + 16 + \frac{5 \cdot 128}{\delta } \right) \\&\le \frac{\delta ' n^{4.5}}{\delta ^2} \cdot 6\cdot 128 \end{aligned}$$

which is at most $\delta /2$, for $\delta '$ chosen as above. We conclude that

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \Vert A-X\Vert +\Vert X-VDV^{-1}\Vert \le \delta , \end{aligned}$$

with probability $1-14/n$ as desired.

To compute the running time and precision, we observe that $\mathsf {SHATTER}$ outputs a grid with parameters

$$\begin{aligned} \omega = \Omega \left( \frac{\delta ^4}{n^5}\right) ,\quad \epsilon =\Omega \left( \frac{\delta ^5}{n^9}\right) . \end{aligned}$$

Plugging this into the guarantees of $\mathsf {EIG}$, we see that it takes

$$\begin{aligned} O\left( \log \frac{n}{\delta }\left( \log \frac{n}{\delta } + \log \log \frac{n}{\delta }\right) T_{\mathsf {MM}}(n) \right) = O(T_\mathsf {MM}(n)\log ^2(n/\delta )) \end{aligned}$$

arithmetic operations, on a floating point machine with precision

$$\begin{aligned} O\left( \log ^3 \frac{n}{\delta }\log \frac{n}{\delta }\log n\right) = O(\log ^4(n/\delta )\log (n)) \end{aligned}$$

bits, as advertised. $\square $

5.1 Proof of Theorem 5.5

A key stepping-stone in our proof will be the following elementary result controlling the spectrum, pseudospectrum, and eigenvectors after perturbing a shattered matrix.

Lemma 5.8

(Eigenvector Perturbation for a Shattered Matrix) Let $\Lambda _{\epsilon }(A)$ be shattered with respect to a grid whose squares have side length $\omega $, and assume that $\Vert {{\widetilde{A}}} - A\Vert \le \eta < \epsilon $. Then, (i) each eigenvalue of ${{\widetilde{A}}}$ lies in the same grid square as exactly one eigenvalue of A, (ii) $\Lambda _{\epsilon - \eta }(\widetilde{A})$ is shattered with respect to the same grid, and (iii) for any right unit eigenvector ${{\widetilde{v}}}$ of ${\widetilde{A}}$, there exists a right unit eigenvector of A corresponding to the same grid square, and for which

$$\begin{aligned} \Vert {{\widetilde{v}}} - v\Vert \le \frac{\sqrt{8}\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Proof

For (i), consider $A_t = A + t({{\widetilde{A}}} - A)$ for $t \in [0,1]$. By continuity, the entire trajectory of each eigenvalue is contained in a unique connected component of $\Lambda _\eta (A) \subset \Lambda _\epsilon (A)$. For (ii), $\Lambda _{\epsilon - \eta }({{\widetilde{A}}}) \subset \Lambda _{\epsilon }(A)$, which is shattered by hypothesis. Finally, for (iii), let $w^*$ and ${\widetilde{w}}^*$ be the corresponding left eigenvectors to v and ${\widetilde{v}}$, respectively, normalized so that $w^*v = {\widetilde{w}}^*{\widetilde{v}} = 1$. Let $\Gamma $ be the boundary of the grid square containing the eigenvalues associated to v and ${\widetilde{v}}$, respectively. Then, using a contour integral along $\Gamma $ as in (13) above, one gets

$$\begin{aligned} \Vert {\widetilde{v}}{\widetilde{w}}^*- vw^*\Vert \le \frac{2\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Thus, using that $\Vert v\Vert =1$ and $w^*v = 1$,

$$\begin{aligned} \Vert {\widetilde{v}}{\widetilde{w}}^*-vw^*\Vert \ge \Vert ({\widetilde{v}}{\widetilde{w}}^*-vw^*) v\Vert = \Vert ({\widetilde{w}}^* v) {\widetilde{v}}-v\Vert . \end{aligned}$$

Now, since $({\widetilde{v}}^*v) {\widetilde{v}}$ is the orthogonal projection of v onto the span of ${\widetilde{v}}$, we have that

$$\begin{aligned} \Vert ({\widetilde{w}}^*v){\widetilde{v}}- v\Vert \ge \Vert ({\widetilde{v}}^*v) {\widetilde{v}}- v\Vert = \sqrt{1-|{\widetilde{v}}^*v|^2}. \end{aligned}$$

Multiplying v by a phase we can assume without loss of generality that ${\widetilde{v}}^* v\ge 0$ which implies that

$$\begin{aligned} \sqrt{1-({\widetilde{v}}^*v)^2} = \sqrt{(1-{\widetilde{v}}^*v)(1+{\widetilde{v}}^*v)} \ge \sqrt{1-{\widetilde{v}}^*v}. \end{aligned}$$

The above discussion can now be summarized in the following chain of inequalities

$$\begin{aligned} \sqrt{1-{\widetilde{v}}^*v}\le \sqrt{1-({\widetilde{v}}^*v)^2}\le \Vert ({\widetilde{w}}^*v){\widetilde{v}}- v\Vert \le \Vert {\widetilde{v}}{\widetilde{w}}^*-vw^*\Vert \le \frac{2\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Finally, note that $\Vert v-{\widetilde{v}}\Vert = \sqrt{2-2{\widetilde{v}}^*v} \le \frac{\sqrt{8}\omega }{\pi } \frac{\eta }{\epsilon (\epsilon - \eta )}$ as we wanted to show. $\square $

The algorithm $\mathsf {EIG}$ works by recursively reducing to subinstances of smaller size, but requires a pseudospectral guarantee to ensure speed and stability. We thus need to verify that the pseudospectrum does not deteriorate too substantially when we pass to a sub-problem.

Lemma 5.9

(Shattering is preserved after compression) Suppose P is a spectral projector of $A\in \mathbb {C}^{n\times n}$ of rank k. Let $Q\in \mathbb {C}^{n\times k}$ be such that $Q^*Q=I_k$ and that its columns span the same space as the columns of P. Then for every $\epsilon >0$,

$$\begin{aligned} \Lambda _\epsilon (Q^*AQ)\subset \Lambda _\epsilon (A). \end{aligned}$$

Alternatively, the same pseudospectral inclusion holds if again $Q^*Q=I_k$ and, instead, the columns of Q span the same space as the rows of P.

Proof

We will first analyze the case when the columns of Q span the same space as the columns of P. To begin, note that if $z\in \Lambda _\epsilon (Q^*AQ)$ then there exists $v\in \mathbb {C}^k$ satisfying $\Vert (z-Q^*AQ)v\Vert \le \epsilon \Vert v\Vert $. Since $I_k=Q^*I_nQ$ we have

$$\begin{aligned} \Vert Q^*(z-A)Qv\Vert \le \epsilon \Vert v\Vert . \end{aligned}$$

And, because $Q^*$ acts as an isometry on $\mathrm {range}(Q)$ (the span of the columns of Q) and by assumption this space is invariant under P (and hence under $(z-A)$), we have that $(z-A)Qv\in \mathrm {range}(Q)$, and therefore $\Vert Q^*(z-A)Qv\Vert = \Vert (z-A)Qv\Vert $. From where we obtain

$$\begin{aligned} \Vert (z-A)Qv\Vert \le \epsilon \Vert v\Vert =\epsilon \Vert Qv\Vert , \end{aligned}$$

showing that $z\in \Lambda _\epsilon (A)$.

For the case in which the columns of Q span the rows of P, the above proof can be easily modified by now taking v with the property that $\Vert v^* Q^* (z-A)Q\Vert \le \epsilon \Vert v\Vert $. $\square $

Observation 5.10

Since $\delta ,\omega (\mathsf {g}),\epsilon \le 1$, our assumption on $\eta $ in Line 2 of the pseudocode of $\mathsf {EIG}$ implies the following bounds on $\eta $ which we will use below:

$$\begin{aligned} \eta \le \min \left\{ 0.02,\epsilon /75,\delta /100,\frac{\delta \epsilon ^2}{200\omega (\mathsf {g})}\right\} . \end{aligned}$$

Initial lemmas in hand, let us begin to analyze the algorithm. At several points we will make an assumption on the machine precision in the margin. These will be collected at the end of the proof, where we will verify that they follow from the precision hypothesis of Theorem 5.5.

Correctness.

Lemma 5.11

(Accuracy of $\widetilde{\lambda _i}$) When $\mathsf {DEFLATE}$ succeeds, each eigenvalue of A shares a square of $\mathsf {g}$ with a unique eigenvalue of either $\widetilde{A_{+}}$ or $\widetilde{A_{-}}$, and furthermore $\Lambda _{4\epsilon /5} (\widetilde{A_{\pm }}) \subset \Lambda _\epsilon (A)$.

Proof

Let $P_{\pm }$ be the true projectors onto the two bisection regions found by $\mathsf {SPLIT}(A,\beta )$, $Q_{\pm }$ be the matrices whose orthogonal columns span their ranges, and $A_{\pm } := Q_{\pm }^*A Q_{\pm }$. From Theorem 5.3, on the event that $\mathsf {DEFLATE}$ succeeds, the approximation $\widetilde{Q_{\pm }}$ that it outputs satisfies $\Vert \widetilde{Q_{\pm }} - Q_{\pm }\Vert \le \eta $, so in particular $\Vert \widetilde{Q_{\pm }}\Vert \le 2$ as $\eta \le 1$. The error $E_{6,\pm }$ from performing the matrix multiplications necessary to compute $\widetilde{A_{\pm }}$ admits the bound

$$\begin{aligned} \Vert E_{6,\pm }\Vert&\le \mu _{\mathsf {MM}}(n)\Vert \widetilde{Q_{\pm }}\Vert \Vert A\widetilde{Q_{\pm }}\Vert {\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2\Vert \widetilde{Q_{\pm }}A\Vert {\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2\Vert \widetilde{Q_{\pm }}\Vert ^2\Vert A\Vert {\textbf {u }}\\&\le 16\left( \mu _{\mathsf {MM}}(n){\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2{\textbf {u }}^2\right) \quad \quad \quad \Vert A\Vert \le 4 \text { and } \Vert \widetilde{Q_{\pm }}\Vert \le 1 + \eta \le 1.02 \le \sqrt{2} \\&\le 3\eta \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad {\textbf {u }}\le \frac{\eta }{10 \mu _{\mathsf {MM}}(n)^2}. \end{aligned}$$

Iterating the triangle inequality, we obtain

$$\begin{aligned} \Vert \widetilde{A_{\pm }} - A_{\pm }\Vert&\le \Vert E_{6,\pm }\Vert + \Vert (\widetilde{Q_{\pm }} - Q_{\pm })A\widetilde{Q_{\pm }}\Vert + \Vert Q_{\pm }A(\widetilde{Q_{\pm }} - Q_{\pm })\Vert \\&\le 3\eta + 8\eta + 4\eta \quad \quad \quad \Vert \widetilde{Q_{\pm }} - Q_{\pm }\Vert \le \eta \\&\le \epsilon /5 \quad \quad \quad \quad \quad \quad \quad \eta \le \epsilon /75. \end{aligned}$$

We can now apply Lemma 5.8. $\square $

Everything is now in place to show that, if every call to $\mathsf {DEFLATE}$ succeeds, $\mathsf {EIG}$ has the advertised accuracy guarantees. After we show this, we will lower bound this success probability and compute the running time.

When $A \in \mathbb {C}^{1\times 1}$, the algorithm works as promised. Assume inductively that $\mathsf {EIG}$ has the desired guarantees on instances of size strictly smaller than n. In particular, maintaining the notation from the above lemmas, we may assume that

$$\begin{aligned} ({\widetilde{V}}_{\pm }, \widetilde{D_{\pm }}) = \mathsf {EIG}(\widetilde{A_{\pm }},4\epsilon /5,\mathsf {g}_{\pm },4\delta /5,\theta ,n) \end{aligned}$$

satisfy (i) each eigenvalue of $\widetilde{D_{\pm }}$ shares a square of $\mathsf {g}_{\pm }$ with exactly one eigenvalue of $\widetilde{A_{\pm }}$, and (ii) each column of $\widetilde{V_{\pm }}$ is $4\delta /5$-close to a true eigenvector of $\widetilde{A_{\pm }}$. From Lemma 5.8, each eigenvalue of $\widetilde{A_{\pm }}$ shares a grid square with exactly one eigenvalue of A, and thus the output

$$\begin{aligned} {\widetilde{D}} = \begin{pmatrix} \widetilde{D_+} &{} \\ &{} \widetilde{D_-} \end{pmatrix} \end{aligned}$$

satisfies the eigenvalue guarantee.

To verify that the computed eigenvectors are close to the true ones, let $\widetilde{{\widetilde{v}}_{\pm }}$ be some approximate right unit eigenvector of one of $\widetilde{A_{\pm }}$ output by $\mathsf {EIG}$ (with norm $1 \pm n{\textbf {u }}$), ${\widetilde{v}}_{\pm }$ the exact unit eigenvector of $\widetilde{A_\pm }$ that it approximates, and $v_{\pm }$ the corresponding exact unit eigenvector of $A_{\pm }$. Recursively, $\mathsf {EIG}(A,\epsilon ,\mathsf {g},\delta ,\theta ,n)$ will output an approximate unit eigenvector

$$\begin{aligned} {\widetilde{v}} := \frac{\widetilde{Q_{\pm }} \widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }} \widetilde{{\widetilde{v}}_{\pm }} + e\Vert } + e', \end{aligned}$$

whose proximity to the actual eigenvector $v := Q v_{\pm }$ we need now to quantify. The error terms here are e, a column of the error matrix $E_{8}$ whose norm we can crudely bound by

$$\begin{aligned} \Vert e\Vert \le \Vert E_{8}\Vert \le \mu _{\mathsf {MM}}(n) \Vert \widetilde{Q_{\pm }}\Vert \Vert \widetilde{V_{\pm }}\Vert {\textbf {u }}\le 4\mu _{\mathsf {MM}}(n){\textbf {u }}\le \eta , \end{aligned}$$

and $e'$, a column $E_9$ incurred by performing the normalization in floating point; in our initial discussion of floating point arithmetic we assumed in (16) that $\Vert e'\Vert \le n{\textbf {u }}$.

First, since ${\widetilde{v}} - e'$ and $\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e$ are parallel, the distance between them is just the difference in their norms:

$$\begin{aligned} \left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } - \widetilde{Q_\pm }\widetilde{{\widetilde{v}}_{\pm }} + e \right\|&\le \left| \Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert - 1\right| \le (1 + \eta )(1 + {\textbf {u }}) \\&\quad +\, 4\mu _{\mathsf {MM}}{\textbf {u }}- 1 \le 4\eta . \end{aligned}$$

Inductively $\Vert \widetilde{{\widetilde{v}}_{\pm }} - \widetilde{{\widetilde{v}}_{\pm }} \Vert \le 4\delta /5$, and since $\Vert A_{\pm } - \widetilde{A_{\pm }}\Vert \le \epsilon /5$ and $A_{\pm }$ has shattered $\epsilon $-pseudospectrum from Lemma 5.9, Lemma 5.8 ensures

$$\begin{aligned} \Vert \widetilde{{\widetilde{v}}_{\pm }} - v_{\pm }\Vert&\le \frac{\sqrt{8}\omega (\mathsf {g}) \cdot 15\eta }{\pi \cdot \epsilon (\epsilon - 15\eta )} \\&\le \frac{\sqrt{8}\omega (\mathsf {g})\cdot 15\eta }{\pi \cdot 4\epsilon ^2/5}&\eta \le \epsilon /75 \\&\le \delta /10&\eta \le \frac{\delta \epsilon ^2}{200 \omega (\mathsf {g})}. \end{aligned}$$

Thus putting together the above, iterating the triangle identity, and using $\Vert Q_{\pm }\Vert = 1$,

$$\begin{aligned} \left\| {\widetilde{v}} - v \right\|&=\left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } + e' - Q_{\pm }v_{\pm } \right\| \\&\le \left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } - \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e \right\| + \Vert e'\Vert + \Vert e\Vert + \Vert (\widetilde{Q_{\pm }} - Q_{\pm })\widetilde{{\widetilde{v}}_{\pm }}\Vert \\&\quad + \Vert Q_{\pm }(\widetilde{{\widetilde{v}}_{\pm }} - {\widetilde{v}}_{\pm })\Vert + \Vert Q_{\pm }({\widetilde{v}}_{\pm } - v_{\pm })\Vert \\&\le 4\eta + n{\textbf {u }}+ \mu _{\mathsf {MM}}(n){\textbf {u }}+ \eta (1 + n{\textbf {u }}) + 4\delta /5 + \delta /10 \\&\le 8\eta + 4\delta /5 + \delta /10 n{\textbf {u }},\mu _{\mathsf {MM}}(n) {\textbf {u }}\le \eta \le \delta \eta \le \delta /200. \end{aligned}$$

This concludes the proof of correctness of $\mathsf {EIG}$.

Running Time and Failure Probability. Let’s begin with a simple lemma bounding the depth of $\mathsf {EIG}$’s recursion tree.

Lemma 5.12

(Recursion Depth) The recursion tree of $\mathsf {EIG}$ has depth at most $\log _{5/4} n$, and every branch ends with an instance of size $1\times 1$.

Proof

By Theorem 5.2, $\mathsf {SPLIT}$ can always find a bisection of the spectrum into two regions containing $n_\pm $ eigenvalues, respectively, with $n_+ + n_- = n$ and $n_{\pm } \ge 4n/5$, and when $n\le 5$ can always peel off at least one eigenvalue. Thus the depth d(n) satisfies

$$\begin{aligned} d(n) = {\left\{ \begin{array}{ll} n &{} n\le 5 \\ 1 + \max _{\theta \in [1/5,4/5]} d(\theta n) &{} n > 5 \end{array}\right. } \end{aligned}$$

(43)

As $n \le \log _{5/4}n$ for $n \le 5$, the result is immediate from induction. $\square $

We pause briefly to verify that the assumptions $\delta < 1$, $\epsilon < 1/2$, $\mathsf {grid}$ has side lengths at most 9, and $\Vert A\Vert \le 3.5$ in Theorem 5.5 ensure that every call to $\mathsf {SPLIT}$ throughout the algorithm satisfies the hypotheses of Theorem 5.2, namely that $\epsilon \le 0.5, \beta \le 0.05/n, \Vert A\Vert \le 4,$ and $\mathsf {grid}$ has side lengths of at most 8. Since $\delta ,\epsilon ,$ and $\beta $ are non-increasing as we travel down the recursion tree of $\mathsf {EIG}$—with $\beta $ monotonically decreasing in $\delta $ and $\epsilon $—we need only verify that the hypotheses of Theorem 5.2 hold on the initial call to $\mathsf {EIG}$. The condition on $\epsilon $ is immediately satisfied; for the one on $\beta $, we have

$$\begin{aligned} \beta = \frac{\eta ^4 \theta ^2}{(20 n)^6 \cdot 4 n^8} = \frac{\theta ^2\delta ^4 \epsilon ^8}{200^4 (20 n)^6 \cdot 4 n^8}, \end{aligned}$$

which is clearly at most 0.05/n.

On each new call to $\mathsf {EIG}$ the grid only decreases in size, so the initial assumption is sufficient. Finally, we need that every matrix passed to $\mathsf {SPLIT}$ throughout the course of the algorithm has norm at most 4. Lemma 5.11 shows that if $\Vert A\Vert \le 4$ and has its $\epsilon $-pseudospectrum shattered, then $\Vert \widetilde{A_{\pm }} - A_{\pm }\Vert \le \epsilon /5$, and since $\Vert A_{\pm }\Vert = \Vert A\Vert $, this means $\Vert \widetilde{A_{\pm }}\Vert \le \Vert A\Vert + \epsilon /5$. Thus each time we pass to a subproblem, the norm of the matrix we pass to $\mathsf {EIG}$ (and thus to $\mathsf {SPLIT}$) increases by at most an additive $\epsilon /5$, where $\epsilon $ is the input to the outermost call to $\mathsf {EIG}$. Since $\epsilon $ decreases by a factor of 4/5 on each recursion step, this means that by the end of the algorithm the norm of the matrix passed to $\mathsf {EIG}$ will increase by at most an additive $(\epsilon + (4/5)\epsilon + (4/5)^2 \epsilon + \cdots )/5 = \epsilon \le 1/2$. Thus we will be safe if our initial matrix has norm at most 3.5, as assumed.

Lemma 5.13

(Lower Bounds on the Parameters) Assume $\mathsf {EIG}$ is run on an $n\times n$ matrix, with some parameters $\delta $ and $\epsilon $. Throughout the algorithm, on every recursive call to $\mathsf {EIG}$, the corresponding parameters $\delta '$ and $\epsilon '$ satisfy

$$\begin{aligned} \delta ' \ge \delta /n \qquad \epsilon ' \ge \epsilon /n. \end{aligned}$$

On each such call to $\mathsf {EIG}$, the parameters $\eta '$ and $\beta '$ passed to $\mathsf {SPLIT}$ and $\mathsf {DEFLATE}$ satisfy

$$\begin{aligned} \eta ' \ge \frac{\delta \epsilon ^2}{200n^3} \qquad \beta ' \ge \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{26}}. \end{aligned}$$

Proof

Along each branch of the recursion tree, we replace $\epsilon \leftarrow 4\epsilon /5$ and $\delta \leftarrow 4\delta /5$ at most $\log _{5/4}n$ times, so each can only decrease by a factor of n from their initial settings. The parameters $\eta '$ and $\beta '$ are computed directly from $\epsilon '$ and $\delta '$. $\square $

Lemma 5.14

(Failure Probability) $\mathsf {EIG}$ fails with probability no more than $\theta $.

Proof

Since each recursion splits into at most two subproblems, and the recursion tree has depth $\log _{5/4}n$, there are at most

$$\begin{aligned} 2\cdot 2^{\log _{5/4}n} = 2n^{\frac{\log 2}{\log 5/4}} \le 2n^4 \end{aligned}$$

calls to $\mathsf {DEFLATE}$. We have set every $\eta $ and $\beta $ so that the failure probability of each is $\theta /2n^4$, so a crude union bound finishes the proof. $\square $

The arithmetic operations required for $\mathsf {EIG}$ satisfy the recursive relationship

$$\begin{aligned}&T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n) \le T_{\mathsf {SPLIT}}(n,\epsilon ,\beta ) + T_{\mathsf {DEFLATE}}(n,\beta ,\eta ) + 2 T_{\mathsf {MM}}(n) \\&\qquad + T_{\mathsf {EIG}}(n_+,4\delta /5,\mathsf {g}_{+},4\epsilon /5,\theta ,n) + T_{\mathsf {EIG}}(n_-,4\delta /5,\mathsf {g}_{-},4\epsilon /5,\theta ,n) \\&\qquad + 2T_{\mathsf {MM}}(n) + O(n^2). \end{aligned}$$

All of $T_{\mathsf {SPLIT}}$, $T_{\mathsf {DEFLATE}}$, and $T_{\mathsf {MM}}$ are of the form $\mathrm {polylog}(n)\mathrm {poly}(n)$, with all coefficients nonnegative and exponents in the $\mathrm {poly}(n)$ no smaller than 2. So, for any $n_+ + n_- = n$ and $n_{\pm } \ge 4 n/5$, holding all other parameters fixed, $T_{\mathsf {SPLIT}}(n_+,...) + T_{\mathsf {SPLIT}}(n_-,...) \le \left( (4/5)^2 + (1/5)^2\right) T_{\mathsf {SPLIT}}(n,...) = (17/25)T_{\mathsf {SPLIT}}(n,...)$ and the same holds for $T_{\mathsf {DEFLATE}}$ and $T_{\mathsf {MM}}$. Applying this recursively, with all parameters other than n set to their lower bounds from Lemma 5.13, we then have

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n)&\le \frac{1}{1 - 17/25}\left( T_{\mathsf {SPLIT}}\left( n,\epsilon /n,\mathsf {g},\frac{\delta ^4\epsilon ^8 \theta ^2}{(5n)^{26}}\right) \right. \\&\quad + \left. T_{\mathsf {DEFLATE}}\left( n,\beta /n,\epsilon /n,\frac{\delta ^4\epsilon ^8 \theta ^2}{(5n)^{26}}\right) + 4T_{\mathsf {MM}}(n) + O(n^2 ) \right) \\&= \frac{25}{8}\left( 12 N_{\mathsf {EIG}} \lg \frac{1}{\omega (\mathsf {g})} \left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 2 T_{\mathsf {QR}}(n) \right. \\&\quad + 5 T_{\mathsf {MM}}(n) + n^2 T_{\mathsf {N}} + O(n^2)\bigg ) \\&\le 60 N_{\mathsf {EIG}}\lg \frac{1}{\omega (\mathsf {g})}\left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 10 T_{\mathsf {QR}}(n) + 25 T_{\mathsf {MM}}(n), \end{aligned}$$

where

$$\begin{aligned} N_{\mathsf {EIG}} := \lg \frac{256 n}{\epsilon } + 3\lg \lg \frac{256 n}{\epsilon } + \lg \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59. \end{aligned}$$

In the above inequalities, we’ve substituted in the expressions for $T_{\mathsf {SPLIT}}$ and $T_{\mathsf {DEFLATE}}$ from Theorems 5.2 and 5.3, respectively; $N_{\mathsf {EIG}}$ is defined by recomputing $N_{\mathsf {SPLIT}}$ with the parameter lower bounds, and the $\epsilon ^9$ is not an error. The final inequality uses our assumption $T_\mathsf {N}= O(1)$. Thus using the fast and stable instantiations of $\mathsf {MM}$, $\mathsf {INV}$, and $\mathsf {QR}$ from Theorem 2.10, we have

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n) = O\left( \log \frac{1}{\omega (\mathsf {g})}\left( \log \frac{n}{\epsilon } + \log \log \frac{1}{\theta \delta }\right) T_{\mathsf {MM}}(n,{\textbf {u }})\right) ; \end{aligned}$$

(44)

exact constants can be extracted by analyzing $N_{\mathsf {EIG}}$ and opening Theorem 2.10.

Required Bits of Precision. We will need the following bound on the norms of all spectral projectors.

Lemma 5.15

(Sizes of Spectral Projectors) Throughout the algorithm, every approximate spectral projector ${\widetilde{P}}$ given to $\mathsf {DEFLATE}$ satisfies $\Vert {\widetilde{P}}\Vert \le 10n/\epsilon $.

Proof

Every such ${\widetilde{P}}$ is $\beta $-close to a true spectral projector P of a matrix whose $\epsilon /n$-pseudosepctrum is shattered with respect to the initial $8\times 8$ unit grid $\mathsf {g}$. Since we can generate P by a contour integral around the boundary of a rectangular subgrid, we have

$$\begin{aligned} \Vert {\widetilde{P}}\Vert \le 2 + \Vert P\Vert \le 2 + \frac{32}{2\pi }\frac{n}{\epsilon } \le 10n/\epsilon , \end{aligned}$$

with the last inequality following from $\epsilon < 1$. $\square $

Collecting the machine precision requirements ${\textbf {u }}\le {\textbf {u }}_{\mathsf {SPLIT}},{\textbf {u }}_{\mathsf {DEFLATE}}$ from Theorems 5.2 and 5.3, as well as those we used in the course of our proof so far, and substituting in the parameter lower bounds from Lemma 5.13, we need ${\textbf {u }}$ to satisfy

$$\begin{aligned} {\textbf {u }}&\le \min \left\{ \frac{\left( 1 - \frac{\epsilon }{256n}\right) ^{2^{N_{\mathsf {EIG}} + 1}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}, \right. \\&\qquad \left. \frac{\epsilon }{100 n^2}, \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{26}}\frac{1}{4\Vert {\widetilde{P}}\Vert \max \{\mu _{\mathsf {QR}}(n),\mu _{\mathsf {MM}}(n)\}}, \right. \\&\qquad \left. \frac{\delta \epsilon ^2}{100 n^3 \cdot 2\mu _{\mathsf {QR}}(n)}, \frac{\delta \epsilon ^2}{100 n^3 \max \{ 4\mu _{\mathsf {MM}}(n),n, 2\mu _{\mathsf {QR}}(n)\}} \right\} \end{aligned}$$

From Lemma 5.15, $\Vert {\widetilde{P}}\Vert \le 10n/\epsilon $, so the conditions in the second two lines are all satisfied if we make the crass upper bound

$$\begin{aligned} {\textbf {u }}\le \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{30}}\frac{1}{\max \{\mu _{\mathsf {QR}}(n),\mu _{MM}(n),n\}}, \end{aligned}$$

(45)

i.e. if $\lg 1/{\textbf {u }}\ge O\left( \lg \frac{n}{\theta \delta \epsilon }\right) $. Unpacking the first requirement, using the definition $ N_{\mathsf {EIG}} := \lg \tfrac{256 n}{\epsilon } + 3\lg \lg \tfrac{256 n}{\epsilon } + \lg \lg \tfrac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59$ from Theorem 5.5, and recalling that $\epsilon \le 1/2$, $n \ge 1$, and $(1 - x)^{1/x} \ge 1/4$ for $x \in (0,1/512)$, we have

$$\begin{aligned} \frac{\left( 1 - \frac{\epsilon }{256n}\right) ^{2^{N_{\mathsf {EIG}} + 1}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}&= \frac{\left( \left( 1 - \frac{\epsilon }{256n}\right) ^{\frac{256n}{\epsilon }}\right) ^{\lg ^3\frac{256 n}{\epsilon } \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{8.59}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n}N_{\mathsf {EIG}}} \\&\ge \frac{4^{-\lg ^3\frac{256 n}{\epsilon } \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{8.59}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}, \end{aligned}$$

so setting ${\textbf {u }}$ smaller than the final expression is sufficient to guarantee $\mathsf {EIG}$ and all subroutines can execute as advertised. This gives

$$\begin{aligned} \lg 1/{\textbf {u }}&\ge \lg ^3 \frac{n}{\epsilon }\lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{9.59}(c_{\mathsf {INV}}\log n + 3) + \lg N_{\mathsf {EIG}} \\&= O\left( \log ^3\frac{n}{\epsilon }\log \frac{n}{\theta \delta \epsilon }\log n\right) \nonumber . \end{aligned}$$

This dominates the precision requirement from (45), and completes the proof of Theorem 5.5.

Remark 5.16

A constant may be extracted directly from the expression above—leaving $\epsilon ,\delta ,\theta $ fixed, a crude bound on it is $2^{9.59} \cdot 26 \cdot 8 \cdot c_{\mathsf {INV}} \approx 160303 c_{\mathsf {INV}}$. This can certainly be optimized, the improvement with the highest impact would be tighter analysis of $\mathsf {SPLIT}$, with the aim of eliminating the additive 7.59 term in $N_{\mathsf {SPLIT}}$.

6 Conclusion and Open Questions

In this paper, we reduced the approximate diagonalization problem to a polylogarithmic number of matrix multiplications, inversions, and QR factorizations on a floating point machine with precision depending only polylogarithmically on n and $1/\delta $. The key phenomena enabling this were: (a) every matrix is $\delta $-close to a matrix with well-behaved pseudospectrum, and such a matrix can be found by a complex Gaussian perturbation and (b) the spectral bisection algorithm can be shown to converge rapidly to a forward approximate solution on such a well-behaved matrix, using a polylogarithmic in n and $1/\delta $ amount of precision and number of iterations. The combination of these facts yields a $\delta $-backward approximate solution for the original problem.

Using fast matrix multiplication, we obtain algorithms with nearly optimal asymptotic computational complexity (as a function of n, compared to matrix multiplication), for general complex matrices with no assumptions. Using naïve matrix multiplication, we get easily implementable algorithms with $O(n^3)$ type complexity and much better constants which are likely faster in practice. The constants in our bit complexity and precision estimates (see Theorem 5.5 and equations (41) and (42)), while not huge, are likely suboptimal. The reasonable practical performance of spectral bisection based algorithms is witnessed by the many empirical papers (see e.g. [5]) which have studied it. The more recent of these works further show that such algorithms are communication-avoiding and have good parallelizability properties.

Remark 6.1

(Hermitian Matrices) A curious feature of our algorithm is that even when the input matrix is Hermitian or real symmetric, it begins by adding a complex non-Hermitian perturbation to regularize the spectrum. If one is only interested in this special case, one can replace this first step by a Hermitian GUE or symmetric GOE perturbation and appeal to the result of [1] instead of Theorem 1.4, which also yields a polynomial lower bound on the minimum gap of the perturbed matrix. It is also possible to obtain a much stronger analysis of the Newton iteration in the Hermitian case, since the iterates are all Hermitian and $\kappa _V=1$ for such matrices. By combining these observations, one can obtain a running time for Hermitian matrices which is significantly better (in logarithmic factors) than our main theorem. We do not pursue this further since our main goal was to address the more difficult non-Hermitian case.

We conclude by listing several directions for future research.

1.
Devise a deterministic algorithm with similar guarantees. The main bottleneck to doing this is deterministically finding a regularizing perturbation, which seems quite mysterious. Another bottleneck is computing a rank-revealing QR factorization in near matrix multiplication time deterministically (all of the currently known deterministic algorithms require $\Omega (n^3)$ time).
2.
Determine the correct exponent for smoothed analysis of the eigenvalue gap of $A+\gamma G$ where G is a complex Ginibre matrix. We currently obtain roughly $(\gamma /n)^{8/3}$ in Theorem 3.6. Is it possible to match the $n^{-4/3}$ type dependence [64] which is known for a pure Ginibre matrix?
3.
Reduce the dependence of the running time and precision to a smaller power of $\log (1/\delta )$. The bottleneck in the current algorithm is the number of bits of precision required for stable convergence of the Newton iteration for computing the sign function. Other, “inverse-free” iterative schemes have been proposed for this, which conceivably require lower precision.
4.
Study the convergence of “scaled Newton iteration” and other rational approximation methods (see [40, 48]) for computing the sign function on non-Hermitian matrices. Perhaps these have even faster convergence and better stability properties?

More broadly, we hope that the techniques introduced in this paper—pseudospectral shattering and pseudospectral analysis of matrix iterations using contour integrals—are useful in attacking other problems in numerical linear algebra.

Notes

A detailed discussion of these and other related results appears in Sect. 1.3.
This quantity is inspired by but not identical to the “reciprocal of the distance to ill-posedness” for the eigenproblem considered by Demmel [24], to which it is polynomially related. See also [63] for another natural definition of eigenvector condition number similar in spirit to that of Demmel.
In fact, it can be shown that $\kappa _{\mathrm {eig}}(A)$ is related by a $\mathrm {poly}(n)$ factor to the smallest constant for which (4) holds for all sufficiently small $\delta >0$.
Doing the inversions exactly in rational arithmetic could require numbers of bit length $n^k$ for k iterations, which will typically not even be polynomial.
At the time of writing, the work [55] is still an unpublished arXiv preprint.
This is called an a fortiriori bound in numerical analysis.
[15] states: “A priori backward and forward error bounds for evaluation of the matrix sign function remain elusive.”
The output of their algorithm is n vectors on each of which Newton’s method converges quadratically to an eigenvector, which they refer to as “approximation à la Smale”.
“The remaining nontrivial problems are, of course, the estimation of the above output precision p [sufficient for finding an approximate eigenvector from an approximate eigenvalue], $\ldots $ . We leave these open problems as a challenge for the reader.”—[51, Section 12].
We are not aware a published analysis of this algorithm in finite arithmetic, but believe that it can be carried out with $O(\log (n/\delta ))$ bits of precision. The only issue that needs to be handled is forward instability of the QR step when the Wilkinson shift is very close to an eigenvalue of the matrix, which can be resolved e.g. by a small random perturbation of the Wilkinson shift.
In this manuscript we will use $z-M$ as a shorthand notation for $zI-M$ where I denotes the identity matrix.
$G_n$ is almost surely invertible and under this event U and R are uniquely determined by these conditions.
Any algorithm that yields the QR decomposition can be modified in a stable way to satisfy this last condition at the cost of $O^*(n\log (1/{\textbf {u }}))$ operations.
For example, diagonalizable matrices satisfy this criterion.

References

M. Aizenman, R. Peled, J. Schenker, M. Shamis, and S. Sodin. Matrix regularizing effects of Gaussian perturbations. Communications in Contemporary Mathematics, 19(03):1750028, 2017.
MathSciNet MATH Google Scholar
D. Armentano, C. Beltrán, P. Bürgisser, F. Cucker, and M. Shub. A stable, polynomial-time algorithm for the eigenpair problem. Journal of the European Mathematical Society, 20(6):1375–1437, 2018.
MathSciNet MATH Google Scholar
G. B. Arous and P. Bourgade. Extreme gaps between eigenvalues of random matrices. The Annals of Probability, 41(4):2648–2681, 2013.
MathSciNet MATH Google Scholar
Z. Bai and J. Demmel. Using the matrix sign function to compute invariant subspaces. SIAM Journal on Matrix Analysis and Applications, 19(1):205–225, 1998.
MathSciNet MATH Google Scholar
Z. Bai, J. Demmel, and M. Gu. An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems. Numerische Mathematik, 76(3):279–308, 1997.
MathSciNet MATH Google Scholar
G. Ballard, J. Demmel, and I. Dumitriu. Minimizing communication for eigenproblems and the singular value decomposition. arXiv preprint arXiv:1011.3077, 2010.
G. Ballard, J. Demmel, I. Dumitriu, and A. Rusciano. A generalized randomized rank-revealing factorization. arXiv preprint arXiv:1909.06524, 2019.
J. Banks, J. Garza-Vargas, A. Kulkarni, and N. Srivastava. Overlaps, eigenvalue gaps, and pseudospectrum under real ginibre and absolutely continuous perturbations. arXiv preprint arXiv:2005.08930, 2020.
J. Banks, A. Kulkarni, S. Mukherjee, and N. Srivastava. Gaussian regularization of the pseudospectrum and Davies’ conjecture. arXiv preprint arXiv:1906.11819, to appear in Communications on Pure and Applied Mathematics, 2019.
A. N. Beavers and E. D. Denman. A computational method for eigenvalues and eigenvectors of a matrix with real eigenvalues. Numerische Mathematik, 21(5):389–396, 1973.
MathSciNet MATH Google Scholar
A. N. Beavers Jr. and E. D. Denman. A new similarity transformation method for eigenvalues and eigenvectors. Mathematical Biosciences, 21(1-2):143–169, 1974.
MathSciNet MATH Google Scholar
M. Ben-Or and L. Eldar. A quasi-random approach to matrix spectral analysis. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
D. Bindel, S. Chandresekaran, J. Demmel, D. Garmire, and M. Gu. A fast and stable nonsymmetric eigensolver for certain structured matrices. Technical report, Technical report, University of California, Berkeley, CA, 2005.
R. Byers. Numerical stability and instability in matrix sign function based algorithms. In Computational and Combinatorial Methods in Systems Theory. Citeseer, 1986.
R. Byers, C. He, and V. Mehrmann. The matrix sign function method and the computation of invariant subspaces. SIAM Journal on Matrix Analysis and Applications, 18(3):615–632, 1997.
MathSciNet MATH Google Scholar
R. Byers and H. Xu. A new scaling for Newton’s iteration for the polar decomposition and its backward stability. SIAM Journal on Matrix Analysis and Applications, 30(2):822–843, 2008.
MathSciNet MATH Google Scholar
J.-y. Cai. Computing Jordan normal forms exactly for commuting matrices in polynomial time. International Journal of Foundations of Computer Science, 5(03n04):293–302, 1994.
MATH Google Scholar
G. Cipolloni, L. Erdős, and D. Schröder. On the condition number of the shifted real ginibre ensemble. arXiv preprint arXiv:2105.13719, 2021.
R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth. On the lambert $w$ function. Advances in Computational mathematics, 5(1):329–359, 1996.
MathSciNet MATH Google Scholar
E. B. Davies. Approximate diagonalization. SIAM Journal on Matrix Analysis and Applications, 29(4):1051–1064, 2007.
MathSciNet MATH Google Scholar
T. Dekker and J. Traub. The shifted qr algorithm for hermitian matrices. Linear Algebra Appl, 4:137–154, 1971.
MathSciNet MATH Google Scholar
J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59–91, 2007.
MathSciNet MATH Google Scholar
J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 106(2):199–224, 2007.
MathSciNet MATH Google Scholar
J. W. Demmel. On condition numbers and the distance to the nearest ill-posed problem. Numerische Mathematik, 51(3):251–289, 1987.
MathSciNet MATH Google Scholar
J. W. Demmel. The probability that a numerical analysis problem is difficult. Mathematics of Computation, 50(182):449–480, 1988.
MathSciNet MATH Google Scholar
J. W. Demmel. Applied numerical linear algebra, volume 56. SIAM, 1997.
E. D. Denman and A. N. Beavers Jr. The matrix sign function and computations in systems. Applied Mathematics and Computation, 2(1):63–94, 1976.
MathSciNet MATH Google Scholar
I. Dumitriu. Smallest eigenvalue distributions for two classes of $\beta $-Jacobi ensembles. Journal of Mathematical Physics, 53(10):103301, 2012.
MathSciNet MATH Google Scholar
A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4):543–560, 1988.
MathSciNet MATH Google Scholar
A. Edelman and N. R. Rao. Random matrix theory. Acta Numerica, 14:233–297, 2005.
MathSciNet MATH Google Scholar
A. Edelman and B. D. Sutton. The beta-Jacobi matrix model, the CS decomposition, and generalized singular value problems. Foundations of Computational Mathematics, 8(2):259–285, 2008.
MathSciNet MATH Google Scholar
P. J. Forrester. Log-gases and random matrices (LMS-34). Princeton University Press, 2010.
S. Ge. The Eigenvalue Spacing of IID Random Matrices and Related Least Singular Value Results. PhD thesis, UCLA, 2017.
A. Greenbaum, R.-c. Li, and M. L. Overton. First-order perturbation theory for eigenvalues and eigenvectors. SIAM Review, 62(2):463–482, 2020.
MathSciNet MATH Google Scholar
M. Grötschel, L. Lovász, and A. Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
M. Gu and S. C. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM Journal on Scientific Computing, 17(4):848–869, 1996.
MathSciNet MATH Google Scholar
U. Haagerup and F. Larsen. Brown’s spectral distribution measure for $R$-diagonal elements in finite von Neumann algebras. Journal of Functional Analysis, 176(2):331–367, 2000.
MathSciNet MATH Google Scholar
N. J. Higham. The matrix sign decomposition and its relation to the polar decomposition. Linear Algebra and its Applications, 212:3–20, 1994.
MathSciNet MATH Google Scholar
N. J. Higham. Accuracy and stability of numerical algorithms, volume 80. SIAM, 2002.
N. J. Higham. Functions of matrices: theory and computation, volume 104. SIAM, 2008.
W. Hoffmann and B. N. Parlett. A new proof of global convergence for the tridiagonal QL algorithm. SIAM Journal on Numerical Analysis, 15(5):929–937, 1978.
MathSciNet MATH Google Scholar
R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, 2012.
V. Jain, A. Sah, and M. Sawhney. On the real Davies’ conjecture. arXiv preprint arXiv:2005.08908, 2020.
C. S. Kenney and A. J. Laub. The matrix sign function. IEEE Transactions on Automatic Control, 40(8):1330–1348, 1995.
MathSciNet MATH Google Scholar
A. Louis and S. S. Vempala. Accelerated newton iteration for roots of black box polynomials. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–740. IEEE, 2016.
A. N. Malyshev. Parallel algorithm for solving some spectral problems of linear algebra. Linear Algebra and Its Applications, 188:489–520, 1993.
MathSciNet MATH Google Scholar
F. Mezzadri. How to generate random matrices from the classical compact groups. arXiv preprint arXiv:math-ph/0609050, 2006.
Y. Nakatsukasa and R. W. Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev’s functions. SIAM Review, 58(3):461–493, 2016.
MathSciNet MATH Google Scholar
Y. Nakatsukasa and N. J. Higham. Backward stability of iterations for computing the polar decomposition. SIAM Journal on Matrix Analysis and Applications, 33(2):460–479, 2012.
MathSciNet MATH Google Scholar
H. Nguyen, T. Tao, and V. Vu. Random matrices: tail bounds for gaps between eigenvalues. Probability Theory and Related Fields, 167(3-4):777–816, 2017.
MathSciNet MATH Google Scholar
V. Y. Pan and Z. Q. Chen. The complexity of the matrix eigenproblem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 507–516. ACM, 1999.
B. N. Parlett. The symmetric eigenvalue problem, volume 20. SIAM, 1998.
J. D. Roberts. Linear model reduction and solution of the algebraic Riccati equation by use of the sign function. International Journal of Control, 32(4):677–687, 1980.
MathSciNet MATH Google Scholar
A. Sankar, D. A. Spielman, and S.-H. Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006.
MathSciNet MATH Google Scholar
D. Shi and Y. Jiang. Smallest gaps between eigenvalues of random matrices with complex Ginibre, Wishart and universal unitary ensembles. arXiv preprint arXiv:1207.4240, 2012.
S. Smale. On the efficiency of algorithms of analysis. Bulletin (New Series) of The American Mathematical Society, 13(2):87–121, 1985.
S. Smale. Complexity theory and numerical analysis. Acta Numerica, 6:523–551, 1997.
MathSciNet MATH Google Scholar
P. Śniady. Random regularization of Brown spectral measure. Journal of Functional Analysis, 193(2):291–313, 2002.
MathSciNet MATH Google Scholar
D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
MathSciNet MATH Google Scholar
J.-G. Sun. Perturbation bounds for the Cholesky and QR factorizations. BIT Numerical Mathematics, 31(2):341–352, 1991.
MathSciNet MATH Google Scholar
S. J. Szarek. Condition numbers of random matrices. Journal of Complexity, 7(2):131–149, 1991.
MathSciNet MATH Google Scholar
L. N. Trefethen and M. Embree. Spectra and pseudospectra: the behavior of nonnormal matrices and operators. Princeton University Press, 2005.
C. Van Loan. On estimating the condition of eigenvalues and eigenvectors. Linear Algebra and Its Applications, 88:715–732, 1987.
MathSciNet MATH Google Scholar
J. P. Vinson. Closest spacing of eigenvalues. arXiv preprint arXiv:1111.2743, 2011.
J. Von Neumann and H. H. Goldstine. Numerical inverting of matrices of high order. Bulletin of the American Mathematical Society, 53(11):1021–1099, 1947.
MathSciNet MATH Google Scholar
J. H. Wilkinson. Global convergence of tridiagonal QR algorithm with origin shifts. Linear Algebra and its Applications, 1(3):409–420, 1968.
MathSciNet MATH Google Scholar
T. G. Wright and L. N. Trefethen. Eigtool. Software available athttp://www.comlab.ox.ac.uk/pseudospectra/eigtool, 2002.

Download references

Acknowledgements

We thank Peter Bürgisser for introducing us to this problem, and Ming Gu, Olga Holtz, Vishesh Jain, Ravi Kannan, Pravesh Kothari, Lin Lin, Satyaki Mukherjee, Yuji Nakatsukasa, and Nick Trefethen for helpful conversations. We thank the referees for a careful reading of the paper and many helpful comments which improved it. We thank the Institute for Pure and Applied Mathematics, where part of this work was carried out.

Author information

Authors and Affiliations

Department of Mathematics, UC Berkeley, Berkeley, 94720, CA, USA
Jess Banks, Jorge Garza-Vargas, Archit Kulkarni & Nikhil Srivastava

Authors

Jess Banks
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Garza-Vargas
View author publications
You can also search for this author in PubMed Google Scholar
Archit Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikhil Srivastava.

Additional information

Communicated by Peter Bürgisser.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jess Banks supported by the NSF Graduate Research Fellowship Program under Grant DGE-1752814. Nikhil Srivastava supported by NSF Grant CCF-1553751.

Appendices

A Deferred Proofs from Sect. 4

Lemma A.1

(Restatement of Lemma 4.11) Assume the matrix inverse is computed by an algorithm $\mathsf {INV}$ satisfying the guarantee in Definition 2.7. Then $\mathsf {G}(A) = g(A) + E$ for some error matrix E with norm

$$\begin{aligned} \Vert E \Vert \le \left( \Vert A \Vert + \Vert A^{-1} \Vert + \mu _{\mathsf {INV}}(n) \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert \right) 2 \sqrt{n} {\textbf {u }}.\end{aligned}$$

(46)

Proof

The computation of $\mathsf {G}(A)$ consists of three steps:

1.
Form $A^{-1}$ according to Definition 2.7. This incurs an additive error of $E_{\mathsf {INV}} = \mu _{\mathsf {INV}}(n)\cdot {\textbf {u }}\cdot \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert $. The result is $\mathsf {INV}(A) = A^{-1} + E_\mathsf {INV}.$
2.
Add A to $\mathsf {INV}(A)$. This incurs an entry-wise relative error of size ${\textbf {u }}$: The result is
$$\begin{aligned}(A + A^{-1} + E_{\mathsf {INV}}) \circ (J + E_{add})\end{aligned}$$
where J denotes the all-ones matrix, $\Vert E_{add} \Vert _{max} \le {\textbf {u }}$, and where $\circ $ denotes the entrywise (Hadamard) product of matrices.
3.
Divide the resulting matrix by 2, which is an exact operation in our floating-point model as we can simply decrement the exponent. The final result is
$$\begin{aligned} \mathsf {G}(A) = \frac{1}{2}(A + A^{-1} + E_{\mathsf {INV}}) \circ (J + E_{add}). \end{aligned}$$

Finally, recall that for any $n \times n$ matrices M and E, we have the relation (14)

$$\begin{aligned} \Vert M \circ E\Vert \le \Vert M \Vert \Vert E \Vert _{max} \sqrt{n}. \end{aligned}$$

Putting it all together, we have

$$\begin{aligned} \Vert \mathsf {G}(A) - g(A) \Vert&\le \frac{1}{2} \left( \Vert A \Vert + \Vert A^{-1} \Vert \right) {\textbf {u }}\sqrt{n} + \Vert E_\mathsf {INV}\Vert ( 1 + {\textbf {u }}) \sqrt{n} \\&\le \frac{1}{2} \left( \Vert A \Vert + \Vert A^{-1} \Vert \right) {\textbf {u }}\sqrt{n} + \mu _{\mathsf {INV}}(n)\cdot {\textbf {u }}\cdot \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert ( 1 + {\textbf {u }}) \sqrt{n} \\&\le \left( \Vert A \Vert + \Vert A^{-1} \Vert + \mu _{\mathsf {INV}}(n) \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert \right) 2 \sqrt{n} {\textbf {u }}\end{aligned}$$

where we use ${\textbf {u }}< 1$ in the last line. $\square $

In what remains of this section we will repeatedly use the following simple calculus fact.

Lemma A.2

Let $x, y >0$, then

$$\begin{aligned} \log (x+y) \le \log (x)+ \frac{y}{x} \quad \mathrm {and} \quad \lg (x+y) \le \lg (x)+ \frac{1}{\log 2}\frac{y}{ x}. \end{aligned}$$

Proof

This follows directly from the concavity of the logarithm. $\square $

Lemma A.3

(Restatement of Lemma 4.15)

Let $1/800> t > 0$ and $1/2> c > 0$ be given. Then for

$$\begin{aligned} j \ge \lg (1/t) + 2 \lg \lg (1/t) + \lg \lg (1/c) + 1.62, \end{aligned}$$

we have

$$\begin{aligned} \frac{(1-t)^{2^j}}{t^{2j}} < c. \end{aligned}$$

Proof of Lemma 4.15

An exact solution for j can be written in terms of the Lambert W-function; see [19] for further discussion and a useful series expansion. For our purposes, it is simpler to derive the necessary quantitative bound from scratch.

Immediately from the assumption $t < 1/800$, we have $j > \log (1/t) \ge 9$.

First let us solve the case $c = 1/2$. We will prove the contrapositive, so assume

$$\begin{aligned} \frac{(1-t)^{2^j}}{t^{2j}} \ge 1/2. \end{aligned}$$

Then taking $\log $ on both sides, we have

$$\begin{aligned} 2j \log (1/t) + 1 \ge - 2^j \log (1-t) \ge 2^j t. \end{aligned}$$

Taking $\lg $ of both sides and applying the second inequality in Lemma A.2 with $x=2j \log (1/t)$ and $y=1$, using $\lg x = 1 + \lg j + \lg \log (1/t)$, we obtain

$$\begin{aligned} 1 + \lg j + \lg \log (1/t) + \frac{1}{\log 2} \frac{1}{2 j \log (1/t)} \ge j + \lg t. \end{aligned}$$

Since $t < 1/800$ we have $\frac{1}{\log 2} \frac{1}{2 j \log (1/t)} < 0.01$, so

$$\begin{aligned} j - \lg j \le \lg (1/t) + \lg \log (1/t) + 1.01 \le \lg (1/t) + \lg \lg (1/t) + 0.49 =: K. \end{aligned}$$

But since $j \ge 9$, we have $j - \lg j \ge 0.64 j$, so

$$\begin{aligned}j \le \frac{1}{0.64}(j-\lg j) \le \frac{1}{0.64}K \end{aligned}$$

which implies

$$\begin{aligned}j \le K + \lg j \le K + \lg (1.57 K) = K + \lg K + 0.65. \end{aligned}$$

Note $K \le 1.39 \lg (1/t)$, because $K - \lg (1/t) = \lg \lg (1/t) + 0.49 \le 0.39 \lg (1/t)$ for $t \le 1/800$. Thus

$$\begin{aligned}\lg K \le \lg (1.39 \lg (1/t)) \le \lg \lg (1/t) + 0.48, \end{aligned}$$

so for the case $c=1/2$ we conclude the proof of the contrapositive of the lemma:

$$\begin{aligned} j&\le K + \lg K + 0.65 \\&\le \lg (1/t) + \lg \lg (1/t) + 0.49 + (\lg \lg (1/t) + 0.48) + 0.65 \\&= \lg (1/t) + 2 \lg \lg (1/t) + 1.62. \end{aligned}$$

For the general case, once $(1-t)^{2^j}/t^{2j} \le 1/2$, consider the effect of incrementing j on the left-hand side. This has the effect of squaring and then multiplying by $t^{2j-2}$, which makes it even smaller. At most $\lg \lg (1/c)$ increments are required to bring the left-hand side down to c, since $(1/2)^{2^{\lg \lg (1/c)}} = c$. This gives the value of j stated in the lemma, as desired. $\square $

Lemma A.4

(Restatement of Lemma 4.18) If

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil , \end{aligned}$$

then

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Proof of Lemma 4.18

We aim to provide a slightly cleaner sufficient condition on N than the current condition

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Repeatedly using Lemma A.2, as well as the cruder fact $\lg \lg (ab) \le \lg \lg a + \lg \lg b$ provided $a, b \ge 4$, we have

$$\begin{aligned} \lg \lg (16 / ( \beta s^2 \epsilon _0))&\le \lg \lg (16/s^2) + \lg \lg (1/(\beta \epsilon _0)) \\&= 1 + \lg (3 + \lg (1/s)) + \lg \lg (1/(\beta \epsilon _0)) \\&\le 1 + \lg \lg (1/s) + \frac{ 3}{\log 2 \lg (1/s)} + \lg \lg (1/(\beta \epsilon _0)) \\&\le \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 1.66 \end{aligned}$$

where in the last line we use the assumption $s < 1/100$. Similarly,

$$\begin{aligned} \lg (8/s) + 2 \lg \lg (8/s)&\le 3 + \lg (1/s) + 2 \lg (3 + \lg (1/s)) \\&\le 3 + \lg (1/s) + 2 \left( \lg \lg (1/s) + \frac{3}{\log 2 \lg (1/s)} \right) \\&\le \lg (1/s) + 2 \lg \lg (2/s) + 4.31 \end{aligned}$$

Thus, a sufficient condition is

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil . \\ \end{aligned}$$

$\square $

B Analysis of $\mathsf {SPLIT}$

Although it has many potential uses in its own right, the purpose of the approximate matrix sign function in our algorithm is to split the spectrum of a matrix into two roughly equal pieces, so that approximately diagonalizing A may be recursively reduced to two sub-problems of smaller size.

First, we need a lemma ensuring that a shattered pseudospectrum can be bisected by a grid line with at least n/5 eigenvalues on each side.

Lemma B.1

Let A have $\epsilon $-pseudospectrum shattered with respect to some grid $\mathsf {g}$. Then there exists a horizontal or vertical grid line of $\mathsf {g}$ partitioning $\mathsf {g}$ into two grids $\mathsf {g}_\pm $, each containing at least $\min \{n/5,1\}$ eigenvalues.

Proof

We will view $\mathsf {g}$ as a $s_1 \times s_2$ array of squares. Write $r_1,r_2,...,r_{s_1}$ for the number of eigenvalues in each row of the grid. Either there exists $1 \le i < s_2$ such that $r_1 + \cdots + r_i \ge n/5$ and $r_{i+1} + \cdots + r_{s_1} \ge n/5$—in which case we can bisect at the grid line dividing the ith from $(i+1)$st rows—or there exists some i for which $r_{i} \ge 3/5$. In the latter case, we can always find a vertical grid line so that at least n/5 of the eigenvalues in the ith row are on each of the left and right sides. Finally, if $n\le 5$, we may trivially pick a grid line to bisect along so that both sides contain at least one eigenvalue. $\square $

Proof of Theorem 5.2

The main observation is that, given any matrix A, we can determine how many eigenvalues are on either side of any horizontal or vertical line by approximating the sign function of a shift of the matrix. To be precise, in exact arithmetic $\mathrm {Tr}\,\mathrm {sgn}(A - h) = n_+ - n_-$, where $n_\pm $ are the eigenvalue counts for A on either side of the line ${\text {Re}}z = h$. We will now show that under the shattered pseudospectrum assumption, one can exactly compute $n_+-n_-$ using the advertised precision.

Running $\mathsf {SGN}$ to a final accuracy of $\beta $,

$$\begin{aligned}&|\mathrm {Tr}\, \mathsf {SGN}(M) + e_4 - \mathrm {Tr}\,\mathrm {sgn}(M)| \le |\mathrm {Tr}\, \mathsf {SGN}(M) - \mathrm {Tr}\,\mathrm {sgn}(M)| + |e_4| \\&\quad \le n\big (\Vert \mathsf {SGN}(M) - \mathrm {sgn}(M)\Vert + \Vert \mathsf {SGN}(M)\Vert {\textbf {u }}\big ) \quad \text { Using }(15) \text { to bound }|e_4| \\&\quad \le n\big (\beta + (\beta + \Vert \mathrm {sgn}(M)\Vert ){\textbf {u }}\big ). \end{aligned}$$

It remains to control $\Vert \mathrm {sgn}(M)\Vert $ and quantify the distance between $\mathrm {sgn}(M) = \mathrm {sgn}(A-h+E_2)$ and $\mathrm {sgn}(A-h)$. We first do the latter. Since we need only to modify the diagonal entries of A when creating M, the incurred diagonal error matrix $E_2$ has norm at most ${\textbf {u }}\max _i |A_{i,i} - h|$. Using $|A_{i,i}| \le \Vert A\Vert \le 4$ and $|h| \le 4$, the fact that ${\textbf {u }}\le \epsilon /100 n \le \epsilon /16$ ensures that the $\epsilon /2$-pseudospectrum of M will still be shattered with respect to $\mathsf {g}$. We can then form $\mathrm {sgn}(A-h)$ and $\mathrm {sgn}(M)$ by integrating around the boundary of the portions of $\mathsf {g}$ on either side of the line ${\text {Re}}z = h$, then using the resolvent identity as in Sect. 4, and the fact that $\Lambda _\epsilon (A)$ and $\Lambda _{\epsilon /2}(M)$ are shattered we get

$$\begin{aligned} \Vert \mathrm {sgn}(A)-\mathrm {sgn}(M)\Vert \le \frac{\Vert E_2\Vert }{2\pi }\cdot \frac{1}{\epsilon }\cdot \frac{2}{\epsilon }\omega (2s_1 + 4s_2) \le \frac{128{\textbf {u }}}{\epsilon ^2} \end{aligned}$$

where in the last inequality we have used that $\mathsf {g}$ has side lengths of at most 8 and $\Vert E_2\Vert \le 8 {\textbf {u }}$.

Now, using the contour integral again and the shattered pseudospectrum assumption

$$\begin{aligned} \Vert \mathrm {sgn}(A - h)\Vert \le \frac{1}{2\pi }\frac{1}{\epsilon }\omega (2s_1 + 4s_2) \le 8/\epsilon . \end{aligned}$$

Combining the above bounds we get a a total additive error of $n(\beta + \beta {\textbf {u }}+ 8{\textbf {u }}/\epsilon )+\frac{128 {\textbf {u }}}{\epsilon ^2}$ in computing the trace of the sign function. If $\beta \le 0.1/n$ and ${\textbf {u }}\le \min \{\epsilon /100n, \frac{\epsilon ^2}{512}$, this error will strictly be less than 0.5 and we can round $\mathrm {Tr}\, \mathsf {SGN}(A - h)$ to the nearest real integer. Horizontal bisections work similarly, with $iA - h$ instead.

Now that we have shown that it is possible to compute $n_{+}-n_-$ exactly, recall that from the above discussion, the $\epsilon /2$-pseudospectrum of M will still be shattered with respect to the translation of the original grid $\mathsf {g}$. Using Lemma 4.10 and the fact that ${{\,\mathrm{diam}\,}}(\mathsf {g})^2 = 128$, we can safely call $\mathsf {SGN}$ with parameters $\epsilon _0 = \epsilon /4$ and

$$\begin{aligned} \alpha _0 = 1 - \frac{\epsilon }{256}. \end{aligned}$$

Plugging these in to the Theorem 4.9 ($\epsilon < 1/2$ so $1-\alpha _0 \le 1/100$, and $\beta \le 0.05/n \le 1/12$ so the hypotheses are satisfied) for final accuracy $\beta $ a sufficient number of iterations is

$$\begin{aligned} N_{\mathsf {SPLIT}} := \lg \frac{256}{\epsilon } + 3\lg \lg \frac{256}{\epsilon } + \lg \lg \frac{4}{\beta \epsilon } + 7.59. \end{aligned}$$

In the course of these binary searches, we make at most $\lg s_1 s_2$ calls to $\mathsf {SGN}$ at accuracy $\beta $. These require at most

$$\begin{aligned} \lg s_1s_2\, T_{\mathsf {SGN}}\left( n,\epsilon /2,1 - \frac{\epsilon }{2{{\,\mathrm{diam}\,}}(\mathsf {g})^2},\beta \right) \end{aligned}$$

arithmetic operations. In addition, creating M and computing the trace of the approximate sign function cost us $O(n \lg s_1s_2)$ scalar addition operations. We are assuming that $\mathsf {g}$ has side lengths at most 8, so $\lg s_1 s_2 \le 12\lg 1/\omega (\mathsf {g})$. Combining all of this with the runtime analysis and machine precision of $\mathsf {SGN}$ appearing in Theorem 4.9, we obtain

$$\begin{aligned} T_{\mathsf {SPLIT}}(n,\mathsf {g},\epsilon ,\beta ) \le 12 \lg \frac{1}{\omega (\mathsf {g})}\cdot N_{\mathsf {SPLIT}} \cdot \left( T_{\mathsf {INV}}(n,{\textbf {u }}) + O(n^2) \right) . \end{aligned}$$

$\square $

C Analysis of $\mathsf {DEFLATE}$

The algorithm $\mathsf {DEFLATE}$, defined in Sect. 5, can be viewed as a small variation of the randomized rank revealing algorithm introduced in [22] and revisited subsequently in [7]. Following these works, we will call this algorithm $\mathsf {RURV}$.

Roughly speaking, in finite arithmetic, $\mathsf {RURV}$ takes a matrix A with $\sigma _r(A)/\sigma _{r+1}(A) \gg 1$, for some $1\le r \le n-1$, and finds nearly unitary matrices U, V and an upper triangular matrix R such that $ URV\approx A$. Crucially, R has the block decomposition

$$\begin{aligned} R = \begin{pmatrix} R_{11} &{} R_{12} \\ &{} R_{22} \end{pmatrix}, \end{aligned}$$

(47)

where $R_{11} \in {\mathbb {C}}^{r\times r}$ has smallest singular value close to $\sigma _r(A)$, and $R_{22}$ has largest singular value roughly $\sigma _{r+1}(A)$. We will use and analyze the following implementation of $\mathsf {RURV}$.

As discussed in Sect. 5, we hope to use $\mathsf {DEFLATE}$ to approximate the range of a projector P with rank $r<n$, given an approximation ${\widetilde{P}}$ close to P in operator norm. We will show that from the output of $\mathsf {RURV}({\widetilde{P}})$ we can obtain a good approximation to such a subspace. More specifically, under certain conditions, if $(U, R) = \mathsf {RURV}({{\widetilde{P}}})$, then the first r columns of U carry all the information we need. For a formal statement see Proposition C.12 and Proposition C.18 below.

Since it may be of broader use, we will work in somewhat greater generality, and define the subroutine $\mathsf {DEFLATE}$ which receives a matrix A and an integer r and returns a matrix $S \in {\mathbb {C}}^{n\times r}$ with nearly orthonormal columns. Intuitively, if A is diagonalizable, then under the guarantee that r is the smallest integer k such that $\sigma _k(A)/\sigma _{k+1}(A) \gg 1$, the columns of the output S span a space close to the span of the top r eigenvectors of A. Our implementation of $\mathsf {DEFLATE}$ is as follows.

Throughout this section we use $\mathrm {rurv}(\cdot )$ and $\mathrm {deflate}(\cdot , \cdot ) $ to denote the exact arithmetic versions of $\mathsf {RURV}$ and $\mathsf {DEFLATE}$, respectively. In Sect. C.1 we present a random matrix result that will be needed in the analysis of $\mathsf {DEFLATE}$. In Sect. C.3 we state the properties of $\mathsf {RURV}$ that will be needed. Finally in Sects. C.4 and C.5 we prove the main guarantees of $\mathrm {deflate}$ and $\mathsf {DEFLATE}$, respectively, that are used throughout this paper.

1.1 C.1 Smallest Singular Value of the Corner of a Haar Unitary

We recall the defining property of the Haar measure on the unitary group:

Definition C.1

A random $n\times n$ unitary matrix V is Haar-distributed if, for any other unitary matrix W, VW and WV are Haar-distributed as well.

For short, we will often refer to such a matrix as a Haar unitary.

Let $n > r$ be positive integers. In what follows we will consider an $n\times n$ Haar unitary matrix V and denote by X its upper-left $r \times r$ corner. The purpose of the present subsection is to derive a tail bound for the random variable $\sigma _{r}(X)$. We begin by showing a fact that allows us to reduce our analysis to the case when $r\le n/2$.

Observation C.2

Let $n> r>0$ and $V\in {\mathbb {C}}^{n\times n}$ be a unitary matrix and denote by $V_{11}$ and $V_{22}$ its upper-left $r\times r$ corner and its lower-right $(n-r)\times (n-r)$ corner, respectively. If $r\ge n/2$, then $2r-n$ of the singular values of $V_{11}$ are equal to 1, while the remaining $n-r$ are equal to those of $V_{22}$.

Proof

Decompose V as follows

$$\begin{aligned} V = \left( \begin{array}{cc} V_{11} &{} V_{12} \\ V_{21} &{} V_{22} \end{array} \right) . \end{aligned}$$

Since V is unitary $VV^*=I_n$, and looking at the upper-left corner of this equation we get $V_{11} V_{11}^*+V_{12}V_{12}^*=I_r$. Then, since $V_{11}V_{11}^*=I_r-V_{12}V_{12}^*$, we have $\Lambda (V_{11}V_{11}^*) = 1 -\Lambda (V_{12}V_{12}^*)$.

Now, looking at the lower-right corner of the equation $V^*V=I_n$ we get $V_{12}^*V_{12}+V_{22}^*V_{22}=I_{n-r}$ and hence $\Lambda (V_{22}^*V_{22})=1-\Lambda (V_{12}^*V_{12})$.

Now recall that for any two matrices X and Y, the symmetric difference of the sets $\Lambda (XY)$ and $\Lambda (YX)$ is $\{0\}$, with multiplicity equal to the difference between the dimensions. Hence $\Lambda (V_{12} V_{12}^*) = \Lambda (V_{12}^* V_{12})\cup \{0\}$ where the multiplicity of 0 is $r-(n-r)=2r-n$. Combining this with $\Lambda (V_{11}V_{11}^*) = 1 -\Lambda (V_{12}V_{12}^*)$ and $\Lambda (V_{22}^*V_{22})=1-\Lambda (V_{12}^*V_{12})$ we get the desired result. $\square $

Proposition C.3

($\sigma _{\min }$ of a submatrix of a Haar unitary) Let $n> r>0$ and let V be an $n\times n$ Haar unitary. Let X be the upper left $r\times r$ corner of V. Then, for all $\theta \in (0, 1]$

$$\begin{aligned} {\mathbb {P}}\left[ \frac{1}{\sigma _{r}(X)} \le \frac{1}{\theta } \right] = (1-\theta ^2)^{r(n-r)}. \end{aligned}$$

(48)

In particular, for every $\theta \in (0, 1]$ we have

$$\begin{aligned} {\mathbb {P}}\left[ \frac{1}{\sigma _{r}(X)} \le \frac{\sqrt{r(n-r)}}{\theta } \right] \ge 1-\theta ^2. \end{aligned}$$

(49)

This exact formula for the CDF of the smallest singular value of X is remarkably simple, and we have not seen it anywhere in the literature. It is an immediate consequence of substantially more general results of Dumitriu [28], from which one can extract and simplify the density of $\sigma _r(X)$. We will begin by introducing the relevant pieces of [28], deferring the final proof until the end of this subsection.

Some of the formulas presented here are written in terms of the generalized hypergeometric function which we denote by ${}_2 F_1^\beta (a, b; c; (x_1, \dots , x_m)).$ For our application it is sufficient to know that

$$\begin{aligned} {}_2F_1^\beta (0, b;c, (x_1, \dots , x_m)) = 1, \end{aligned}$$

(50)

whenever $c >0$ and ${}_2F_1$ is well defined. The above equation can be derived directly from the definition of ${}_2F_1^\beta $ (see Definition 13.1.1 in [32] or Definition 2.2 in [28]).

The generic results in [28] concern the $\beta $-Jacobi random matrices, which we have no cause here to define in full. Of particular use to us will be [28, Theorem 3.1], which expresses the density of the smallest singular value of such a matrix in terms of the generalized hypergeometric function:

Theorem C.4

([28]) The density of the probability distribution of the smallest eigenvalue $\lambda $, of the $\beta $-Jacobi ensembles of parameters a, b and size m, which we denote by $f_{\lambda _{\min }}(\lambda )$ , is given by

$$\begin{aligned}&C_{\beta , a, b, m}\lambda ^{\frac{\beta }{2}(a+1)-1}(1-\lambda )^{\frac{\beta }{2}m(b+m)-1} {}_2{F}_1^{2/\beta } \nonumber \\&\quad \left( 1-\frac{\beta (a+1)}{2}, \frac{\beta (b+m-1)}{2}; \frac{\beta (b+2m-1)}{2}+1; (1-\lambda )^{m-1}\right) , \end{aligned}$$

(51)

for some normalizing constant $C_{\beta , a, b, m}$.

For a particular choice of parameters, the above theorem can be applied to describe the distribution of $\sigma _{r}^2(X)$. The connection between singular values of corners of Haar unitary matrices and $\beta $-Jacobi ensembles is the content of [31, Theorem 1.5], which we rephrase below to match our context.

Theorem C.5

([31]) Let V be an $n\times n$ Haar unitary matrix and let $r\le \frac{n}{2}$. Let X be the $r\times r$ upper-left corner of V. Then, the eigenvalues of $XX^*$ distribute as the eigenvalues of a $\beta $-Jacobi matrix of size r with parameters $\beta =2, a=0$ and $b=n-2r$.

In view of the above result, Theorem C.4 gives a formula for the density of $\sigma _{r}^2(X)$.

Corollary C.6

(Density of $\sigma _{r}^2(X)$) Let V be an $n\times n$ Haar unitary and X be its upper-left $r\times r$ corner with $r<n$, then $\sigma _{r}^2(X)$ has the following density

$$\begin{aligned} f_{\sigma _{r}^2}(x) := {\left\{ \begin{array}{ll} r(n-r) \left( 1-x\right) ^{r(n-r)-1} &{} \text {if}\quad 0\le x \le 1,\\ 0 &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

(52)

Proof

If $r > n/2$, since we care only about the smallest singular value of X, we can use Observation C.2 to analyse the $(n-r)\times (n-r)$ lower right corner of V instead. Hence, we can assume without loss of generality that $r\le n/2$. Now, substitute $\beta =2, a=0, b=n-2r, m=r$ in Theorem C.4 and observe that in this case

$$\begin{aligned} f_{\lambda _{\min }}(x) = C (1-x)^{r(n-r)-1} {}_2F_1^{1}(0, n-r-1;n;(1-x)^{r-1}) = C (1-x)^{r(n-r)-1} \end{aligned}$$

(53)

where the last equality follows from (50). Using the relation between the distribution of $\sigma _{r}^2(X)$ and the distribution of the minimum eigenvalue of the respective $\beta $-Jacobi ensemble described in Theorem C.5 we have $f_{\sigma ^2_{r}}(x) = f_{\lambda _{\min }}(x)$. By integrating on [0, 1] the right side of (53) we find $C= r(n-r)$. $\square $

Proof of Proposition C.3

From (52) we have that

$$\begin{aligned} {\mathbb {P}}\left[ \sigma _{r}^2(X)\le \theta \right] = r(n-r) \int _0^\theta (1-x)^{r(n-r)-1} dx = 1-(1-\theta )^{r(n-r)}, \end{aligned}$$

from where (48) follows. To prove (49) note that $g(t):=(1-t)^{r(n-r)}$ is convex in [0, 1], and hence $g(t) \ge g(0)+tg'(0)$ for every $t\in [0, 1]$. $\square $

1.2 C.2 Sampling Haar Unitaries in Finite Precision

It is a well-known fact that Haar unitary matrices can be numerically generated from complex Ginibre matrices. We refer the reader to [30, Section 4.6] and [47] for a detailed discussion. In this subsection we carefully analyze this process in finite arithmetic.

The following fact (see [47, Section 5]) is the starting point of our discussion.

Lemma C.7

(Haar from Ginibre) Let $G_n$ be a complex $n\times n$ Ginibre matrix and $U, R\in {\mathbb {C}}^{n\times n}$ be defined implicitly, as a function of $G_n$, by the equation $G_n = UR$ and the constraints that U is unitary and R is upper-triangular with nonnegative diagonal entries.^{Footnote 12} Then, U is Haar-distributed in the unitary group.

The above lemma suggests that $\mathsf {QR}(\cdot )$ can be used to generate random matrices that are approximately Haar unitaries. While doing this, one should keep in mind that when working with finite arithmetic, the matrix $\widetilde{G_n}$ passed to $\mathsf {QR}$ is not exactly Ginibre-distributed, and the algorithm $\mathsf {QR}$ itself incurs roundoff errors.

Following the discussion in Sect. 2.4 we can assume that we have access to a random matrix ${\widetilde{G}}_n$, with

$$\begin{aligned} {\widetilde{G}}_n = G_n +E, \end{aligned}$$

where $G_n$ is a complex $n\times n$ Ginibre matrix and $E\in {\mathbb {C}}^{n\times n}$ is an adversarial perturbation whose entries are bounded by $\frac{1}{\sqrt{n}} c_{\mathsf {N}}{{\textbf {u}}}$. Hence, we have $\Vert E\Vert \le \Vert E \Vert _F \le \sqrt{n} c_{\mathsf {N}}{\textbf {u }}$.

In what follows we use $\mathrm {QR}(\cdot )$ to denote the exact arithmetic version of $\mathsf {QR}(\cdot )$. Furthermore, we assume that for any $A\in {\mathbb {C}}^{n\times n}$, $\mathrm {QR}(A)$ returns a pair (U, R) with the property that R has nonnegative entries on the diagonal. Since we want to compare $\mathrm {QR}(G_n)$ with $\mathsf {QR}({\widetilde{G}}_n)$ it is necessary to have a bound on the condition number of the QR decomposition. For this, we cite the following consequence of a result of Sun [60, Theorem 1.6]:

Lemma C.8

(Condition number for the QR decomposition [60]) Let $A, E\in {\mathbb {C}}^{n\times n}$ with A invertible. Furthermore assume that $\Vert E\Vert \Vert A^{-1}\Vert \le \frac{1}{2}$. If $(U, R) = \mathrm {QR}(A)$ and $({\widetilde{U}}, {\widetilde{R}}) = \mathrm {QR}(A+E) $, then

$$\begin{aligned} \Vert {\widetilde{U}} - U \Vert _F \le 4 \Vert A^{-1} \Vert \Vert E \Vert _F. \end{aligned}$$

We are now ready to prove the main result of this subsection. As in the other sections devoted to finite arithmetic analysis, we will assume that ${\textbf {u }}$ is small compared to $\mu _{\mathsf {QR}}(n)$; precisely, let us assume that

$$\begin{aligned} {\textbf {u }}\mu _\mathsf {QR}(n)\le 1. \end{aligned}$$

(54)

Proposition C.9

(Guarantees for finite-arithmetic Haar unitary matrices) Suppose that $\mathsf {QR}$ satisfies the assumptions in Definition 2.8 and that it is designed to output upper triangular matrices with nonnegative entries on the diagonal.^{Footnote 13} If $(V, R )=\mathsf {QR}(\widetilde{G_n})$, then there is a Haar unitary matrix U and a random matrix E such that ${\widetilde{V}} = U+E$. Moreover, for every $1>\alpha > 0$ and $ t > 2\sqrt{2}+1$ we have

$$\begin{aligned} {\mathbb {P}}\left[ \Vert E\Vert < \frac{8t n^{\frac{3}{2}}}{\alpha }c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ \frac{10 n^2}{\alpha } c_{\mathsf {N}}{\textbf {u }}\right] \ge 1-2e\alpha ^2-2e^{-t^2 n}. \end{aligned}$$

Proof

From our Gaussian sampling assumption, ${\widetilde{G}}_n = G_n +E$ where $\Vert E\Vert \le \sqrt{n} c_{\mathsf {N}}{\textbf {u }}$. Also, by the assumptions on $\mathsf {QR}$ from Definition 2.8, there are matrices $\widetilde{\widetilde{G_n}}$ and ${\widetilde{V}}$ such that $({\widetilde{V}}, R) = \mathrm {QR}(\widetilde{\widetilde{G_n}})$, and

$$\begin{aligned} \Vert {\widetilde{V}}- V\Vert&< \mu _\mathsf {QR}(n){\textbf {u }}\\ \Vert \widetilde{{\widetilde{G}}_n}- \widetilde{G_n}\Vert&\le \mu _\mathsf {QR}(n){\textbf {u }}\Vert {\widetilde{G}}_n\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}\left( \Vert G_n\Vert +\sqrt{n}c_{\mathsf {N}}{\textbf {u }}\right) . \end{aligned}$$

The latter inequality implies, using (54), that

$$\begin{aligned} \Vert \widetilde{\widetilde{G_n}}-G_n\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}\left( \Vert G_n\Vert +\sqrt{n}c_{\mathsf {N}}{\textbf {u }}\right) + \sqrt{n}c_{\mathsf {N}}{\textbf {u }}\le \mu _\mathsf {QR}(n){\textbf {u }}\Vert G_n\Vert + 2\sqrt{n}c_{\mathsf {N}}{\textbf {u }}.\nonumber \\ \end{aligned}$$

(55)

Let $(U, R') := \mathrm {QR}(G_n)$. From Lemma C.7 we know that U is Haar-distributed on the unitary group, so using (55) and Lemma C.8, and the fact that $\Vert M \Vert \le \Vert M \Vert _F \le \sqrt{n}\Vert M \Vert $ for any $n \times n$ matrix M, we know that

$$\begin{aligned}&\Vert U-V\Vert - \mu _\mathsf {QR}(n){\textbf {u }}\le \Vert U-V\Vert -\Vert {\widetilde{V}}-V \Vert \le \Vert U- {\widetilde{V}}\Vert \nonumber \\&\quad \le 4\sqrt{n} c_{\mathsf {N}}\mu _\mathsf {QR}(n){\textbf {u }}\Vert G_n\Vert \Vert G_n^{-1}\Vert + 10n c_{\mathsf {N}}{\textbf {u }}\Vert G_n^{-1}\Vert . \end{aligned}$$

(56)

Now, from $\Vert G_n^{-1}\Vert = 1/\sigma _{n}(G_n)$ and from Theorem 3.1 we have that

$$\begin{aligned} P \left[ \Vert G_n^{-1} \Vert \ge \frac{n}{\alpha } \right] \le (\sqrt{2e}\alpha )^{2}= 2e \alpha ^2. \end{aligned}$$

On the other hand, from Lemma 2.2 of [9] we have $P\left[ \left\| G_n\right\| > 2\sqrt{2}+ t \right] \le e^{-n t^2} $. Hence, under the events $\Vert G_n^{-1}\Vert \le \frac{n}{\alpha }$ and $\Vert G_n\Vert \le 2\sqrt{2}+t$, inequality (56) yields

$$\begin{aligned} \Vert U-V\Vert \le \frac{4n^\frac{3}{2}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}\left( 2\sqrt{2}+t+1\right) +\frac{10 n^2}{\alpha } c_{\mathsf {N}}{\textbf {u }}. \end{aligned}$$

Finally, if $t>2\sqrt{2}+1$ we can exchange the term $2\sqrt{2}+t+1$ for 2t in the bound. Then, using a union bound we obtain the advertised guarantee. $\square $

1.3 C.3 Preliminaries of $\mathsf {RURV}$

Let $A\in {\mathbb {C}}^{n\times n}$ and $(U, R) =\mathrm {rurv}(A)$. As will become clear later, in order to analyze $\mathsf {DEFLATE}(A, r)$ it is of fundamental importance to bound the quantity $\Vert R_{22}\Vert $, where $R_{22}$ is the lower-right $(n-r)\times (n-r)$ block of R. To this end, it will suffice to use Corollary C.11 below, which is the complex analog to the upper bound given in equation (4) of [7, Theorem 5.1]. Actually, Corollary C.11 is a direct consequence of Lemma 4.1 in the aforementioned paper and Proposition C.3 proved above. We elaborate below.

Lemma C.10

([7]) Let $n>r>0$, $A\in {\mathbb {C}}^{n\times n}$ and $A = P \Sigma Q^*$ be its singular value decomposition. Let $(U, R) = \mathrm {rurv}(A)$, $R_{22}$ be the lower right $(n-r)\times (n-r)$ corner of R, and V be such that $A = URV$. Then, if $X = Q^* V^*$,

$$\begin{aligned} \Vert R_{22}\Vert \le \frac{\sigma _{r+1}(A)}{\sigma _{r}(X_{11}) }, \end{aligned}$$

where $X_{11}$ is the upper left $r\times r$ block of X.

This lemma reduces the problem to obtaining a lower bound on $\sigma _{r}(X_{11})$. But, since V is a Haar unitary matrix by construction and $X= Q^*V$ with $Q^*$ unitary, we have that X is distributed as a Haar unitary. Combining Lemma C.10 and Proposition C.3 gives the following result.

Corollary C.11

Let $n> r>0$, $A\in {\mathbb {C}}^{n\times n}$, $(U, R)= \mathrm {rurv}(A)$ and $R_{22}$ be the lower right $(n-r)\times (n-r)$ corner of R. Then for any $\theta > 0$

$$\begin{aligned} {\mathbb {P}}\left[ \Vert R_{22}\Vert \le \frac{\sqrt{r(n-r)}}{\theta }\sigma _{r+1}(A) \right] \ge 1-\theta ^2. \end{aligned}$$

1.4 C.4 Exact Arithmetic Analysis of $\mathsf {DEFLATE}$

It is a standard consequence of the properties of the QR decomposition that if A is a matrix of rank r, then almost surely $\mathrm {deflate}(A, r)$ is a $n\times r$ matrix with orthonormal columns that span the range of A. As a warm-up let’s recall this argument.

Let $(U, R) = \mathrm {rurv}(A)$ and V be the unitary matrix used by the algorithm to produce this output. Since we are working in exact arithmetic, V is a Haar unitary matrix, and hence it is almost surely invertible. Therefore, with probability 1 we have that $\mathrm {rank}(AV^*) = r$ and that the first r columns of $AV^*$ are linearly independent, so since UR is the QR decomposition of $AV^*$, almost surely, $R_{22} =0$ and $R_{11} \in {\mathbb {C}}^{r\times r}$, where $R_{11}$ and $R_{22}$ are as in (47). Writing

$$\begin{aligned} U = \begin{pmatrix} U_{11} &{} U_{12} \\ U_{21} &{} U_{22} \end{pmatrix} \end{aligned}$$

for the block decomposition of U with $U_{11} \in {\mathbb {C}}^{r\times r}$, note that

$$\begin{aligned} AV^* = UR = \begin{pmatrix} U_{11} R_{11} &{} U_{11}R_{12}+U_{12} R_{22} \\ U_{21} R_{11} &{} U_{21} R_{12}+U_{22} R_{22} \end{pmatrix}. \end{aligned}$$

(57)

On the other hand, almost surely the first r columns of $AV^*$ span the range of A. Using the right side of Eq. (57) we see that this subspace also coincides with the span of the first r columns of U, since $R_{11}$ is invertible.

We will now prove a robust version of the above observation for a large class of matrices, namely those A for which $\mathrm {rank}(A) = \mathrm {rank}(A^2)$.^{Footnote 14} We make this precise below and defer the proof to the end of the subsection.

Proposition C.12

(Main guarantee for $\mathrm {deflate}$) Let $\beta > 0$ and $A, {\widetilde{A}}\in {\mathbb {C}}^{n\times n}$ be such that $\Vert A-{\widetilde{A}}\Vert \le \beta $ and $\mathrm {rank}(A)= \mathrm {rank}(A^2) =r$. Denote $S := \mathrm {deflate}({\widetilde{A}}, r)$ and $T := \mathrm {deflate}(A, r)$. Then, for any $\theta \in (0, 1) $, with probability $1-\theta ^2$ there exists a unitary $W\in {\mathbb {C}}^{r\times r}$ such that

$$\begin{aligned} \Vert S- TW^*\Vert \le \sqrt{\frac{8\sqrt{r(n-r)} }{ \sigma _r(T^*AT)}}\cdot \sqrt{\frac{\beta }{\theta }}. \end{aligned}$$

(58)

Remark C.13

(The projector case) In the case in which the matrix A of Proposition C.12 is a (not necessarily orthogonal) projector, $T^*AT = I_r$, and the $\sigma _r$ term in the denominator of (58) becomes a 1.

We begin by recalling a result about the stability of singular values which will be important throughout this section. This fact is a consequence of Weyl’s inequalities; see for example [42, Theorem 3.3.16] .

Lemma C.14

(Stability of singular values) Let $X, E \in {\mathbb {C}}^{n\times n}$. Then, for any $k=1, \dots , n$ we have

$$\begin{aligned} |\sigma _k(X+E)- \sigma _k(X)|\le \Vert E\Vert . \end{aligned}$$

We now show that the orthogonal projection $P:=\mathrm {deflate}({\widetilde{A}}, r)\mathrm {deflate}({\widetilde{A}}, r)^*$ is close to a projection onto the range of A, in the sense that $P A \approx A$.

Lemma C.15

Let $\beta > 0$ and $A, {\widetilde{A}}\in {\mathbb {C}}^{n\times n}$ be such that $\mathrm {rank}(A)=r$ and $\Vert A-{\widetilde{A}}\Vert \le \beta $. Let $(U, R) := \mathrm {rurv}({\widetilde{A}})$ and $S := \mathrm {deflate}({\widetilde{A}}, r)$. Then, almost surely

$$\begin{aligned} \Vert (SS^* -I_n)A\Vert \le \Vert R_{22}\Vert + \beta , \end{aligned}$$

(59)

where $R_{22}$ is the lower right $(n-r)\times (n-r)$ block of R.

Proof

We will begin by showing that $\Vert (SS^* -I_n) {\widetilde{A}}\Vert $ is small. Let V be the unitary matrix that was used to generate (U, R). As $\mathrm {deflate}(\cdot , \cdot )$ outputs the first r columns of U, we have the block decomposition $U = \begin{pmatrix} S&U' \end{pmatrix}$, where $S \in {\mathbb {C}}^{n\times r}$ and $U' \in {\mathbb {C}}^{n\times (n-r)}$.

On the other hand, we have ${\widetilde{A}} = U RV$, so

$$\begin{aligned} (SS^* - I_n) {\widetilde{A}} = (SS^* - I) \begin{pmatrix} S&U' \end{pmatrix} R V = \begin{pmatrix} 0&-U' \end{pmatrix} R V = \begin{pmatrix} 0&-U' R_{2,2} \end{pmatrix} V . \end{aligned}$$

Since $\Vert U'\Vert = \Vert V\Vert =1$ from the above equation we get $\Vert (SS^*-I_n){\widetilde{A}}\Vert \le \Vert R_{22}\Vert $. Now we can conclude that

$$\begin{aligned} \Vert (SS^* -I_n) A\Vert \le \Vert (SS^* -I_n) {\widetilde{A}}\Vert + \Vert (SS^* -I_n) (A-{\widetilde{A}})\Vert \le \Vert R_{22}\Vert +\beta . \end{aligned}$$

$\square $

The inequality (59) can be applied to quantify the distance between the ranges of $\mathrm {deflate}({\widetilde{A}}, r)$ and $\mathrm {deflate}(A, r)$ in terms of $\Vert R_{22}\Vert $, as the following result shows.

Lemma C.16

(Bound in terms of $\Vert R_{22}\Vert $) Let $\beta > 0$ and $A, {\widetilde{A}}\in {\mathbb {C}}^{n\times n}$ be such that $\mathrm {rank}(A)= \mathrm {rank}(A^2) =r$ and $\Vert A-{\widetilde{A}}\Vert \le \beta $. Denote by $(U, R) :=\mathrm {rurv}({\widetilde{A}})$, $S := \mathrm {deflate}({\widetilde{A}}, r)$ and $T := \mathrm {deflate}(A, r)$. Then, almost surely there exists a unitary $W\in {\mathbb {C}}^{r\times r}$ such that

$$\begin{aligned} \Vert S- TW^*\Vert \le 2\sqrt{\frac{\Vert R_{22}\Vert +\beta }{\sigma _r(T^*AT)}} , \end{aligned}$$

(60)

where $R_{22}$ is the lower right $(n-r)\times (n-r)$ block of R.

Proof

From Lemma C.15 we know that almost surely $\Vert (SS^*-I_n)A\Vert \le \Vert R_{22}\Vert + \beta $. We will use this to show that $\Vert T^*SS^*T- I_r\Vert $ is small, which can be interpreted as $S^*T$ being close to unitary. First note that

$$\begin{aligned}&\Vert T^*SS^*T-I_r\Vert = \sup _{w\in {\mathbb {C}}^r, \Vert w\Vert =1}\Vert T^*(SS^*-I_r)T w\Vert \nonumber \\&\quad = \sup _{ w \in \mathrm {range}(A), \Vert w\Vert =1 } \Vert T^*(SS^*-I_r) w\Vert . \end{aligned}$$

(61)

Now, since $\mathrm {rank}(A) = \mathrm {rank}(A^2)$, if $w \in \mathrm {range}(A)$ then $w = Av$ for some $v\in \mathrm {range}(A)$. So by the Courant–Fischer formula

$$\begin{aligned} \frac{\Vert w\Vert }{\Vert v\Vert } = \frac{\Vert Av\Vert }{\Vert v\Vert } \ge \inf _{u\in \mathrm {range}(A)} \frac{\Vert Au\Vert }{\Vert u\Vert } = \sigma _r(T^*AT). \end{aligned}$$

We can then revisit (61) and get

$$\begin{aligned}&\sup _{ w \in \mathrm {range}(A), \Vert w\Vert =1 } \Vert T^*(SS^*-I_r) w\Vert \nonumber \\&\quad = \sup _{v \in \mathrm {range}(A), \Vert v\Vert \le 1} \frac{\Vert T^*(SS^*-I_r) A v\Vert }{\sigma _r(T^*AT)} \le \frac{\Vert T^*(SS^*-I_r) A T \Vert }{\sigma _r(T^*AT)}. \end{aligned}$$

(62)

On the other hand, $\Vert T^*(SS^*-I_r) A T \Vert \le \Vert (SS^*-I_r) A\Vert \le \Vert R_{22}\Vert +\beta $, so combining this fact with (61) and (62) we obtain

$$\begin{aligned} \Vert T^*SS^*T- I_r\Vert \le \frac{\Vert R_{22}\Vert +\beta }{\sigma _r(T^*AT)}. \end{aligned}$$

Now define $X:= S^*T$, $\beta ' :=\frac{\Vert R_{22}\Vert +\beta }{\sigma _r(T^*AT)} $ and let $X= W | X|$ be the polar decomposition of X. Observe that

$$\begin{aligned} \Vert |X| - I_r \Vert \le \sigma _1(X) - 1 \le |\sigma _1(X)^2 - 1| = \Vert X^* X - I_r \Vert \le \beta '. \end{aligned}$$

Thus $\Vert S^*T- W\Vert = \Vert X-W\Vert = \Vert (|X|-I_n) W \Vert \le \beta '.$ Finally note that

$$\begin{aligned} \Vert S - TW^*\Vert ^2&= \Vert (S^*- WT^*)(S - TW^*)\Vert \\&= \Vert 2I_r - S^*TW^*- WT^*S\Vert \\&= \Vert 2I_r - S^*T(T^*S + W^*- T^*S) - (S^*T + W - S^*T)T^*S\Vert \\&\le 2\Vert I_r - S^*T T^*S\Vert + \Vert S^*T(W^*- T^*S)\Vert + \Vert (W - S^*T)T^*S\Vert \le 4\beta ', \end{aligned}$$

which concludes the proof. $\square $

Note that so far our results have been deterministic. The possibility of failure of the guarantee given in Proposition C.12 comes from the non-deterministic bound on $\Vert R_{22}\Vert $.

Proof of Proposition C.12

From Lemma C.14 we have $\sigma _{r+1}({\widetilde{A}})\le \beta $. Now combine Lemma C.16 with Corollary C.11. $\square $

1.5 C.5 Finite Arithmetic Analysis of $\mathsf {DEFLATE}$

In what follows we will have an approximation ${\widetilde{A}}$ of a matrix A of rank r with the guarantee that $\Vert A-{\widetilde{A}}\Vert \le \beta $.

For the sake of readability we will not present optimal bounds for the error induced by roundoff, and we will assume that

$$\begin{aligned}&4 \Vert A\Vert \cdot \max \{c_{\mathsf {N}}\mu _{\mathsf {MM}}(n) {{\textbf {u}}}, c_{\mathsf {N}}\mu _{\mathsf {QR}}(n) {{\textbf {u}}}\} \le \beta \le \frac{1}{4}\le \Vert A\Vert \quad \mathrm {and} \quad \nonumber \\&\quad 1\le \min \{\mu _\mathsf {MM}(n), \mu _\mathsf {QR}(n), c_{\mathsf {N}}\}. \end{aligned}$$

(63)

We begin by analyzing the subroutine $\mathsf {RURV}$ in finite arithmetic. This was done in [22, Lemma 5.4]. Here we make the constants arising from this analysis explicit and take into consideration that Haar unitary matrices cannot be exactly generated in finite arithmetic.

Lemma C.17

($\mathsf {RURV}$ analysis) Assume that $\mathsf {QR}$ and $\mathsf {MM}$ satisfy the guarantees in Definitions 2.6 and 2.8. Also suppose that the assumptions in (63) hold. Then, if $(U, R) := \mathsf {RURV}(A)$ and V is the matrix used to produce such output, there are unitary matrices ${\widetilde{U}}, {\widetilde{V}}$ and a matrix ${\widetilde{A}}$ such that ${\widetilde{A}} = {\widetilde{U}} R {\widetilde{V}} $ and the following guarantees hold:

1.
$\Vert U-{\widetilde{U}}\Vert \le \mu _\mathsf {QR}(n) {{\textbf {u}}}$.
2.
${\widetilde{V}}$ is Haar-distributed in the unitary group.
3.
For every $1> \alpha >0$ and $t >2 \sqrt{2}+1$, the event:
$$\begin{aligned} \Vert {\widetilde{V}}-V\Vert< & {} \frac{8 t n^{\frac{3}{2}}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ \frac{10 n^2}{\alpha } {\textbf {u }}\quad \mathrm {and} \quad \Vert A-{\widetilde{A}}\Vert < \Vert A\Vert \nonumber \\&\quad \left( \frac{9 t n^{\frac{3}{2}}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ 2 \mu _\mathsf {MM}(n){\textbf {u }}+ \frac{10n^2}{\alpha }c_{\mathsf {N}}{\textbf {u }}\right) \end{aligned}$$
(64)
occurs with probability at least $1- 2e \alpha ^2- 2e^{-t^2 n}$.

Proof

By definition $V=\mathsf {QR}(\widetilde{G_n})$ with ${\widetilde{G}}_n = G_n+E$, where $G_n$ is an $n\times n$ Ginibre matrix and $\Vert E\Vert \le \sqrt{n} {\textbf {u }}$. A direct application of the guarantees on each step yields the following:

1.
From Proposition C.9, we know that there is a Haar unitary ${\widetilde{V}}$ and a random matrix $E_0$, such that $V = {\widetilde{V}}+E_0$ and
$$\begin{aligned} {\mathbb {P}}\left[ \Vert E_0\Vert < \frac{8 t n^{\frac{3}{2}}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ \frac{10 n^2}{\alpha } c_{\mathsf {N}}{\textbf {u }}\right] \ge 1-2e\alpha ^2 -2e^{-t^2 n}. \end{aligned}$$
(65)
2.
If $B:= \mathsf {MM}(A, V^*) = AV^* +E_1$, then from the guarantees for $\mathsf {MM}$ we have $\Vert E_1\Vert \le \Vert A\Vert \Vert V\Vert \mu _\mathsf {MM}(n){{\textbf {u}}}$. Now from the guarantees for $\mathsf {QR}$ we know that V is $\mu _\mathsf {QR}(n) {\textbf {u }}$ away from a unitary, and hence
$$\begin{aligned} \Vert V\Vert \mu _\mathsf {MM}(n) {\textbf {u }}\le (1+\mu _\mathsf {QR}(n){\textbf {u }}) \mu _\mathsf {MM}(n) {\textbf {u }}\le \frac{5}{4}\mu _\mathsf {MM}(n){\textbf {u }}\end{aligned}$$
where the last inequality follows from the assumptions in (63). This translates into
$$\begin{aligned} \Vert B\Vert \le \Vert A\Vert \Vert V\Vert + \Vert E_1\Vert \le (1+ \mu _\mathsf {QR}(n){\textbf {u }}) \Vert A\Vert + \Vert E_1\Vert \le \frac{5}{4}\Vert A\Vert + \Vert E_1\Vert . \end{aligned}$$
Putting the above together and using (63) again, we get
$$\begin{aligned} \Vert E_1\Vert \le \frac{5}{4} \Vert A\Vert \mu _\mathsf {MM}(n) {{\textbf {u}}} \quad \text {and} \quad B \le \frac{5}{4}\Vert A\Vert (1+ \mu _\mathsf {MM}(n){{\textbf {u}}}) < 2\Vert A\Vert . \end{aligned}$$
(66)
3.
Let $(U, R) = \mathsf {QR}(B)$. Then there is a unitary ${\widetilde{U}}$ and a matrix ${\widetilde{B}}$ such that $U= {\widetilde{U}} + E_2$, $B = {\widetilde{B}}+E_3$, and ${\widetilde{B}} = {\widetilde{U}} R$, with error bounds $\Vert E_2\Vert \le \mu _\mathsf {QR}(n) {{\textbf {u}}}$ and $\Vert E_3\Vert \le \Vert B\Vert \mu _\mathsf {QR}(n) {{\textbf {u}}}$. Using (66) we obtain
$$\begin{aligned} \Vert E_3\Vert \le \Vert B\Vert \mu _\mathsf {QR}(n) {{\textbf {u}}} < 2\Vert A\Vert \mu _\mathsf {QR}(n) {{\textbf {u}}}. \end{aligned}$$
(67)
4.
Finally, define ${\widetilde{A}} := {\widetilde{B}} {\widetilde{V}}$. Note that ${\widetilde{A}} = {\widetilde{U}} R {\widetilde{V}}$ and
$$\begin{aligned}&{\widetilde{A}} = {\widetilde{B}} {\widetilde{V}} = (B-E_3) {\widetilde{V}} = (AV^* +E_1-E_3) {\widetilde{V}} = (A({\widetilde{V}}+E_0)^*+E_1-E_3 ){\widetilde{V}} \\&\quad = A + (AE_0^*+E_1-E_3){\widetilde{V}}, \end{aligned}$$
which translates into
$$\begin{aligned} \Vert A-{\widetilde{A}}\Vert \le \Vert A\Vert \Vert E_0\Vert + \Vert E_1\Vert + \Vert E_3\Vert . \end{aligned}$$
Hence, on the event described in the left side of (65), we have
$$\begin{aligned} \Vert A- {\widetilde{A}}\Vert \le \Vert A\Vert \left( \frac{8t n^{\frac{3}{2}}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ \frac{10 n^2}{\alpha } c_{\mathsf {N}}{\textbf {u }}+\frac{5}{4}\mu _\mathsf {MM}(n){\textbf {u }}+2 \mu _\mathsf {QR}(n){\textbf {u }}\right) , \end{aligned}$$
and using some crude bounds, the above inequality yields the advertised bound.

$\square $

We can now prove a finite arithmetic version of Proposition C.12.

Proposition C.18

(Main guarantee for $\mathsf {DEFLATE}$)

Let $n> r $ be positive integers, and let $\beta ,\theta > 0$ and $A, {\widetilde{A}}\in {\mathbb {C}}^{n\times n}$ be such that $\Vert A-{\widetilde{A}}\Vert \le \beta $ and $\mathrm {rank}(A)= \mathrm {rank}(A^2) =r$. Let $S := \mathsf {DEFLATE}({\widetilde{A}}, r)$ and $T := \mathrm {deflate}(A, r)$. If $\mathsf {QR}$ and $\mathsf {MM}$ satisfy the guarantees in Definitions 2.6 and 2.8, and (63) holds, then, for every $t > 2\sqrt{2}+1$ there exist a unitary $W\in {\mathbb {C}}^{r\times r}$ such that

$$\begin{aligned} \Vert S- T W^*\Vert \le \mu _\mathsf {QR}(n){{\textbf {u}}} + 12\sqrt{\frac{tn^2\sqrt{r(n-r)} }{\sigma _r(T^*AT)}}.\sqrt{\frac{\beta }{\theta ^2}}, \end{aligned}$$

(68)

with probability at least $1-7\theta ^2- 2e^{-t^2 n}$.

Proof

Let $(U, R) = \mathsf {RURV}({\widetilde{A}})$. From Lemma C.17 we know that there exist ${\widetilde{U}}, \widetilde{{\widetilde{A}}} \in {\mathbb {C}}^{n\times n}$, such that $\Vert U-{\widetilde{U}} \Vert $ and $\Vert {\widetilde{A}}- \widetilde{{\widetilde{A}}}\Vert $ are small, and $({\widetilde{U}}, R) = \mathrm {rurv}(\widetilde{{\widetilde{A}}})$ for the respective realization of an exact Haar unitary matrix. Then, from $\Vert {\widetilde{A}}\Vert \le \Vert A\Vert + \beta $ and (64), for every $1> \alpha >0$ and $t > 2 \sqrt{2}+1$ we have

$$\begin{aligned}&\left\| A-\widetilde{{\widetilde{A}}}\right\| \le \left\| \widetilde{{\widetilde{A}}}-{\widetilde{A}}\right\| + \Vert {\widetilde{A}}- A\Vert \le (\Vert A\Vert +\beta )\left( \frac{9 t n^{\frac{3}{2}}}{\alpha } \mu _\mathsf {QR}(n) c_{\mathsf {N}}{\textbf {u }}+ 2 \mu _\mathsf {MM}(n){\textbf {u }}+ \frac{10n^2}{\alpha }c_{\mathsf {N}}{\textbf {u }}\right) + \beta ,\nonumber \\ \end{aligned}$$

(69)

with probability $1-2e\alpha ^2-2e^{-t^2 n}$.

Now, from (63) we have ${\textbf {u }}\le \beta \le \frac{1}{4}$ and $c_{\mathsf {N}}\Vert A\Vert \mu {\textbf {u }}\le \beta $ for $\mu = \mu _\mathsf {QR}(n), \mu _\mathsf {MM}(n)$, so we can bound the respective terms in (69) by $\beta $:

$$\begin{aligned}&(\Vert A\Vert +\beta )\left( \frac{9 t n^{\frac{3}{2}}}{\alpha } c_{\mathsf {N}}\mu _\mathsf {QR}(n) {\textbf {u }}+ 2 \mu _\mathsf {MM}(n){\textbf {u }}+ \frac{10n^2}{\alpha } c_{\mathsf {N}}{\textbf {u }}\right) \nonumber \\&\qquad + \beta \le (1+ \beta )\left( \frac{9 t n^\frac{3}{2}}{\alpha }\beta +2\beta +\frac{10n^2}{\alpha }\beta \right) +\beta \nonumber \\&\quad \le \frac{(12t+16)}{\alpha }n^2 \beta , \end{aligned}$$

(70)

where the last crude bound uses $1 \le n^{\frac{3}{2}}\le n^2, 1+\beta \le \frac{5}{4}$ and $t > 2$.

Observe that ${\widetilde{S}} = \mathrm {deflate}(\widetilde{{\widetilde{A}}}, r)$ is the matrix formed by the first r columns of ${\widetilde{U}}$, and that by Proposition C.12 we know that for every $\theta > 0$, with probability $1-\theta ^2$ there exists a unitary W such that

$$\begin{aligned} \Vert {\widetilde{S}}- T W^*\Vert \le \sqrt{\frac{8\sqrt{r(n-r)} }{\sigma _r(T^*AT)}}.\sqrt{\frac{\left\| A-\widetilde{{\widetilde{A}}}\right\| }{\theta }}. \end{aligned}$$

(71)

On the other hand, S is the matrix formed by the first r columns of U. Hence

$$\begin{aligned} \Vert S-{\widetilde{S}}\Vert \le \Vert U- {\widetilde{U}}\Vert \le \mu _\mathsf {QR}(n) {{\textbf {u}}}. \end{aligned}$$

Putting the above together we get that under this event

$$\begin{aligned} \Vert S- T W^*\Vert \le \Vert S-{\widetilde{S}}\Vert + \Vert {\widetilde{S}}- TW^*\Vert \le \mu _{\mathsf {QR}} (n){{\textbf {u}}} + \sqrt{\frac{8\sqrt{r(n-r)} }{\sigma _r(T^*AT)}}.\sqrt{\frac{\left\| A-\widetilde{{\widetilde{A}}}\right\| }{\theta }}. \end{aligned}$$

(72)

Now, taking $\alpha = \theta $, we note that both events in (69) and (71) happen with probability at least $1-(2e+1)\theta ^2-2e^{-t^2 n}$. The result follows from replacing the constant $2e+1$ with 7, using $t> 2\sqrt{2}+1$ and replacing $8(12t+16)$ with 144t, and combining the inequalities (69), (70) and (72). $\square $

We end by proving Theorem 5.3, the guarantees on $\mathsf {DEFLATE}$ that we will use when analyzing the main algorithm.

Proof of Theorem 5.3

As Remark C.13 points out, in the context of this theorem we are passing to $\mathsf {DEFLATE}$ an approximate projector ${\widetilde{P}}$, and the above result simplifies. Using this fact, as well as the upper bound $r(n-r) \le n^2/4$, we get that

$$\begin{aligned} \Vert S - TW^*\Vert \le \mu _{\mathsf {QR}}(n){\textbf {u }}+ \frac{12 \sqrt{t n^3 \beta }}{\theta }. \end{aligned}$$

with probability at least $1 - 7\theta ^2 - 2 e^{-t^2 n}$ for every $t > 2\sqrt{2}$. If our desired quality of approximation is $\Vert S - TW^*\Vert = \eta $, then some basic algebra gives the success probability as at least

$$\begin{aligned} 1 - 1008\frac{n^3 t \beta }{(\eta - \mu _{\mathsf {QR}}(n){\textbf {u }})^2} - 2e^{-t^2n}. \end{aligned}$$

Since $\beta \le 1/4$, we can safely set $t = \sqrt{2/\beta }$, giving

$$\begin{aligned} 1 - 1426\frac{n^3\sqrt{\beta }}{(\eta - \mu _{\mathsf {QR}}(n){\textbf {u }})^2} - 2e^{-2n/\beta }. \end{aligned}$$

To simplify even further, we’d like to use the upper bound $2e^{-2n/\beta } \le \frac{n^3\sqrt{\beta }}{(\eta - \mu _{\mathsf {QR}}(n){\textbf {u }})^2}$. These two terms have opposite curvature in $\beta $ on the interval (0, 1), and are equal at zero, so it suffices to check that the inequality holds when $\beta = 1$. The terms only become closer by setting $n=1$ everywhere except in the argument of $\mu _{\mathsf {QR}}(\cdot )$, so we need only check that

$$\begin{aligned} \frac{2}{e^2} \le \frac{1}{(\eta - \mu _{\mathsf {QR}}(n){\textbf {u }})^2}. \end{aligned}$$

Under our assumptions $\eta ,\mu _{\mathsf {QR}}(n){\textbf {u }}\le 1$, the right-hand side is greater than one, and the left hand less. Thus we can make the replacement, use ${\textbf {u }}\le \tfrac{\eta }{2\mu _{QR}(n)}$, and round for readability to a success probability of no worse than

$$\begin{aligned} 1 - 6000\frac{n^3\sqrt{\beta }}{\eta ^2}; \end{aligned}$$

the constant here is certainly not optimal.

Finally, for the running time, we need to sample $n^2$ complex Gaussians, perform two QR decompositions, and one matrix multiplication; this gives the total bit operations as

$$\begin{aligned} T_{\mathsf {DEFLATE}}(n) = n^2 T_\mathsf {N}+ 2T_\mathsf {QR}(n)+T_\mathsf {MM}(n). \end{aligned}$$

$\square $

Remark C.19

Note that the exact same proof of Theorem 5.3 goes through in the more general case where the matrix in question is not necessarily a projection, but any matrix close to a rank-deficient matrix A. In this case an extra $\sigma _r(T^*AT)$ term appears in the probability of success (see the guarantee given in the box for the Algorithm $\mathsf {DEFLATE}$ that appears in this appendix).

D Alternate Proofs of Shattering and Davies’ Conjecture

In this section we’ll give an alternate and essentially different route to the smoothed analysis of eigenvalue gap and condition numbers in Theorem 1.4, as well as the proof of Davies’ Conjecture in [9], via results from [2]. We’ll begin by recalling some notation from [2], and direct the reader there for a more thorough treatment.

For any n let $\mathbb {P}(\mathbb {C}^n)$ denote the projective space associated to $\mathbb {C}^n$, and given $A \in \mathbb {C}^{n \times n}$, $\lambda \in \mathbb {C}$ and $v \in \mathbb {P}(\mathbb {C})$, define $A_{\lambda , v}: v^{\perp } \rightarrow v^{\perp }$ by

$$\begin{aligned}A_{\lambda , v} := P_v^\perp \, \circ \, (A - \lambda ) \vert _{v^\perp }\end{aligned}$$

where $v^{\perp } = \{ x \in \mathbb {C}^n \mid \langle x, v \rangle = 0 \}$ and $P_{v^\perp } : \mathbb {C}^n \rightarrow v^\perp $ denotes the orthogonal projection. With this in hand, [2] defines the condition number of a triple $(A, \lambda , v)\in \mathbb {C}^{n\times n}\times \mathbb {C}\times \mathbb {P}(\mathbb {C}^n)$ as

$$\begin{aligned} \mu (A, \lambda , v) := {\left\{ \begin{array}{ll}\Vert A \Vert _F \Vert A_{\lambda , v}^{-1} \Vert &{} \text { if } A_{\lambda , v} \text { is invertible, } \\ \infty &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

They similarly define the mean square condition number of a matrix as

$$\begin{aligned} \mu _{F, \mathsf {av}}(A) := \left( \frac{1}{n} \sum _{j=1}^n \Vert A \Vert _F^2 \Vert \Vert A_{\lambda _j, v_j}^{-1} \Vert _F^2 \right) ^\frac{1}{2}, \end{aligned}$$

where $(\lambda _j, v_j)$ are the eigenpairs of A. In particular, note that $\mu _{F, \mathsf {av}}(A)<\infty $ only when A has simple eigenvalues, and therefore $\mu _{F, \mathsf {av}}(A)<\infty $ implies that A is diagonalizable.

1.1 D.1 Davies’ conjecture

To compare the notions of eigenvalue condition number and the condition number of a triple we recall the following theorem from [2]:

Theorem D.1

(Part of Proposition 2.7 of [2]) Let ${\mathcal {V}}$ denote the solution variety for the eigenpair problem, defined as

$$\begin{aligned} {\mathcal {V}} = {\mathcal {V}}_n := \{(A, \lambda , v) \in {\mathbb {C}}^{n \times n} \times \mathbb {C}\times {\mathbb {P}}(\mathbb {C}) \mid (A - \lambda )v = 0 \}, \end{aligned}$$

and let $\Gamma : [0, 1] \rightarrow {\mathcal {V}}$, $\Gamma (t) = (A_t, \lambda _t, v_t)$ be a smooth curve such that $A_t$ lies in the unit sphere of ${\mathbb {C}}^{n \times n}$ for all t. Then for all $t \in [0, 1]$,

$$\begin{aligned} |\dot{\lambda _t}| \le \sqrt{1 + \mu (A_t, \lambda _t, v_t)^2} \Vert \dot{A_t} \Vert . \end{aligned}$$

Now recall that $\kappa (\lambda )$ has the following variational description (e.g. see Theorem 1 in [34]) for any a simple eigenpair $(\lambda , v)$ of A, in terms of the derivatives of smooth curves going through the point $(A, \lambda , v)$. Namely

$$\begin{aligned} \kappa (\lambda )= \sup _{\begin{array}{c} \Gamma :[0, 1] \rightarrow \mathcal {V}, \, \Gamma (0)=(A, \lambda , v) \end{array}} \frac{|{\dot{\lambda }}_0|}{\Vert {\dot{A}}_0\Vert }. \end{aligned}$$

Hence, Theorem D.1 implies

$$\begin{aligned} \kappa (\lambda ) \le \sqrt{1+\mu (A, \lambda , v)^2}. \end{aligned}$$

(73)

It is then clear that $\mu _{F, \mathsf {av}}(A)$ can also be used to upper bound $\kappa _V(A)$. In view of this, we remind the reader of the following result from [2].

Theorem D.2

(Theorem 2.14 of [2]) Let $G_n \in {\mathbb {C}}^{n \times n}$ denote a complex Ginibre matrix with ${\mathcal {N}}(0, 1_\mathbb {C}/n)$ entries. For any $A \in \mathbb {C}^{n \times n}$ and $\gamma > 0$, we have

$$\begin{aligned} \mathbb {E}\left[ \frac{\mu _{F, \mathsf {av}}(A + \gamma G_n)^2}{\Vert A+ \gamma G_n \Vert _F^2} \right] \le \frac{n^2 }{\gamma ^2}.\end{aligned}$$

We are now ready to prove the following result, which directly implies Davies’ conjecture (for comparison, Theorem 1.1 of [9] is the same result but the exponent of n is 3/2 instead of 5/2.)

Proposition D.3

Suppose $A \in \mathbb {C}^{n \times n}$ and $\gamma \in (0,1)$. Then there is a matrix $E \in \mathbb {C}^{n \times n}$ such that $\Vert E \Vert \le \gamma \Vert A \Vert $ and

$$\begin{aligned} \kappa _V(A + E) \le C \frac{n^{5/2}}{\gamma } \end{aligned}$$

where C is an absolute constant.

Proof

Let $\lambda _i, v_i$ denote the (random) eigenvalues and eigenvectors of $A + \gamma G_n$. Let $B_r$ denote the event $\Vert A + \gamma G_n \Vert _F < r$. Because $ \Vert G_n \Vert _F < 2\sqrt{n}$ with probability at least some absolute positive constant, for $ r = \Vert A \Vert + 2 \sqrt{n}$ the event $B_r$ holds with that probability as well. Now note that

$$\begin{aligned} \mathbb {E}\left[ \sum _i \kappa (\lambda _i)^2 \mid B_r \right]&\le \mathbb {E}\left[ n + \sum _i \mu (A + \gamma G_n, \lambda _i, v_i)^2 \mid B_r \right]&\text { by } (73) \nonumber \\&\le \mathbb {E}\left[ n + n \mu _{F, \mathsf {av}}(A + \gamma G_n)^2 \mid B_r \right] \le n + \frac{n^3 r^2}{\gamma ^2 \mathbb {P}[B_r]}, \end{aligned}$$

(74)

where in the last line we use Theorem D.2 and

$$\begin{aligned} \mathbb {E}\left[ \left. \frac{\mu _{F, \mathsf {av}}(A + \gamma G_n)^2}{r^2} \, \right| B_r \right] \le \mathbb {E}\left[ \left. \frac{\mu _{F, \mathsf {av}}(A + \gamma G_n)^2}{\Vert A+ \gamma G_n \Vert _F^2} \, \right| B_r \right] \le \frac{\mathbb {E}\left[ \frac{\mu _{F, \mathsf {av}}(A + \gamma G_n)^2}{\Vert A+ \gamma G_n \Vert _F^2} \right] }{ \mathbb {P}[B_r]}.\end{aligned}$$

Recalling the general inequality (see Lemma 3.1 [9]) $\kappa _V \le \sqrt{n \sum _{i=1}^n \kappa (\lambda _i)^2}$ we get

$$\begin{aligned} \mathbb {E}[\kappa _V(A+\gamma G_n)^2| B_r] \le n \mathbb {E}\left[ \sum _i \kappa (\lambda _i)^2 \mid B_r \right] . \end{aligned}$$

So, when $\Vert A \Vert = 1$ and $\gamma < 1$, if we set $ r = \Vert A \Vert + 2 \sqrt{n}$ as discussed above, the event $B_r$ occurs with positive probability, and by (74) we know that $n \mathbb {E}\left[ \sum \kappa (\lambda _i)^2 \mid B_r \right] \le \frac{C n^5}{\gamma ^2}$ for some constant C. It follows that there is some realization of $G_n$ for which $\kappa _V(A+\gamma G_n)^2 \le \frac{C n^5}{\gamma ^2}$, as we wanted to show. $\square $

1.2 D.2 Smoothed analysis of $\mathrm {gap}$

Let $M\in \mathbb {C}^{n\times n}$ be any matrix, and let $\lambda _1, \dots , \lambda _n$ be its eigenvalues. In what follows we will denote

$$\begin{aligned} \mathrm {gap}_i(M) := \min _{j\ne i} |\lambda _i -\lambda _j|. \end{aligned}$$

We begin by comparing these quantities to the condition number of the corresponding triple.

Lemma D.4

Let M be a matrix with distinct eigenvalues and spectral decomposition $M=\sum _{i=1}^n \lambda _i v_i w_i^*$. Then, for every $i=1, \dots , n$ it holds that

$$\begin{aligned} \frac{\mu (M, \lambda _i, v_i)}{\Vert M\Vert _F} \ge \frac{1}{\mathrm {gap}_i(M)}. \end{aligned}$$

Proof

First we show that $\Lambda (M_{\lambda _i, v_i}) = \Lambda (M-\lambda _i)\setminus \{0\}$. To see this, take any $j\ne i$ and note that

$$\begin{aligned} w_j^* P_{v_i^\perp } \, \circ \, (M-\lambda _i) \vert _{v_i^\perp } = (\lambda _j-\lambda _i) w_i^* , \end{aligned}$$

and hence $\lambda _j-\lambda _i$ is an eigenvalue of $M_{\lambda _i, v_i}$.

Now, using that the norm of a matrix is bigger than its spectral radius we get

$$\begin{aligned} \Vert M_{\lambda _i, v_i}^{-1}\Vert&\ge \sup _{\lambda \in \Lambda (M_{\lambda _i, v_i})} \frac{1}{|\lambda |} \\&= \frac{1}{\mathrm {gap}_i(M)}&\text {because } \Lambda (M_{\lambda _i, v_i}) = \Lambda (M-\lambda _i)\setminus \{0\}. \end{aligned}$$

The claim then follows from the definition of $\mu (M, \lambda _i, v_i)$. $\square $

Using Theorem D.2 we get the following.

Proposition D.5

Let $A\in \mathbb {C}^{n\times n}$ be an arbitrary matrix and let $G_n$ be a normalized complex Ginibre matrix. Then for any $t, \gamma >0$

$$\begin{aligned} \mathbb {P}[\mathrm {gap}(A+\gamma G_n) < t \gamma ] \le n^3 t^2. \end{aligned}$$

Thus, $\mathrm {gap}(A+\gamma G_n)=O(\gamma /n^{3/2})$ with probability bounded away from zero.

Proof

Using Lemma D.4 we get

$$\begin{aligned} \frac{1}{\mathrm {gap}(A+\gamma G_n)^2}= & {} \max _{i} \frac{1}{\mathrm {gap}_i(A+\gamma G_n)^2} \le \max _i \frac{\mu (A+\gamma G_n, \lambda _i, v_i)^2}{\Vert A + \gamma G_n\Vert _F^2} \\\le & {} n \frac{\mu _{F, \mathsf {av}}(A+\gamma G_n)^2}{\Vert A+\gamma G_n\Vert _F^2}. \end{aligned}$$

Combining this with Theorem D.2 we obtain

$$\begin{aligned} \mathbb {E}\left[ \frac{1}{\mathrm {gap}(A+\gamma G_n)^2} \right] \le \frac{n^3}{\gamma ^2}. \end{aligned}$$

The proof is then concluded using Markov’s inequality. $\square $

Remarkably, the $\gamma $ dependence in the bound of Proposition D.5 is optimal, partially answering Question 2 in Sect. 6 posed by us in a previous version of this paper. Also note that the bound is stronger than that from Corollary 3.7 obtained with our techniques. That said, and as discussed in Remark 1.11, the results in [2] that were necessary for the proof of Proposition D.5 heavily exploit that the random perturbation has a complex Gaussian distribution, and it is not clear how to extend these result to other distributions.

With this in mind, we publicize the following conjecture of independent interest in random matrix theory, which was communicated to us by Vishesh Jain:

Conjecture D.6

Let $K>0$ and let $M_n$ be an $n \times n$ random matrix with independent complex entries (not necessarily centered), whose distributions are absolutely continuous with respect to the Lebesgue measure on $\mathbb {C}$, and have density upper bounded by K. Then

$$\begin{aligned} \mathbb {P}\Big [\mathrm {gap}(M_n)< \frac{t}{K}\Big ] \le \mathrm {poly}(n, t) \quad \forall t>0, \end{aligned}$$

where $\mathrm {poly}(n, t)$ is a universal polynomial (i.e. its coefficients are independent of the distributions of the entries of $M_n$) in n and t, which is zero when $t=0$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Banks, J., Garza-Vargas, J., Kulkarni, A. et al. Pseudospectral Shattering, the Sign Function, and Diagonalization in Nearly Matrix Multiplication Time. Found Comput Math 23, 1959–2047 (2023). https://doi.org/10.1007/s10208-022-09577-5

Download citation

Received: 11 February 2021
Revised: 01 February 2022
Accepted: 04 March 2022
Published: 24 August 2022
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10208-022-09577-5

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Pseudospectral Shattering, the Sign Function, and Diagonalization in Nearly Matrix Multiplication Time

Abstract

Similar content being viewed by others

Two sufficient descent spectral conjugate gradient algorithms for unconstrained optimization with application

Pascual Jordan: from matrix multiplication to interference law

Statistics and characterization of matrices by determinant and trace

1 Introduction

1.1 Problem Statement

1.1.1 Accuracy and Conditioning

Proposition 1.1

Remark 1.2

1.1.2 Models of Computation

Remark 1.3

1.2 Results and Techniques

Theorem 1.4

Theorem 1.5

Theorem 1.6

Corollary 1.7

Remark 1.8

1.3 Related Work

Remark 1.9

Remark 1.10

Remark 1.11

2 Preliminaries

2.1 Spectral Projectors and Holomorphic Functional Calculus

Proposition 2.1

2.2 Pseudospectrum and Spectral Stability

Proposition 2.2

Proposition 2.3

Lemma 2.4

Proof of Proposition 1.1

2.3 Finite-Precision Arithmetic

2.4 Sampling Gaussians in Finite Precision

Definition 2.5

2.5 Black-box Error Assumptions for Multiplication, Inversion, and QR

Definition 2.6

Definition 2.7

Definition 2.8

Remark 2.9

Theorem 2.10

Proof

3 Pseudospectral Shattering

3.1 Smoothed Analysis of Gap and Eigenvector Condition Number

Theorem 3.1

Theorem 3.2

Corollary 3.3

Proof

Theorem 3.4

Lemma 3.5

Proof

Theorem 3.6

Proof

Proof of Theorem 1.4

Corollary 3.7

Proof

3.2 Shattering

Definition 3.8

Definition 3.9

Observation 3.10

Lemma 3.11

Proof

Theorem 3.12

Proof

Theorem 3.13

Proof

4 Matrix Sign Function

4.1 Circles of Apollonius

Observation 4.1

Proof

4.2 Exact Arithmetic

Proposition 4.2

Proof

Lemma 4.3

Proof

Lemma 4.4

Proof

Lemma 4.5

Proof

Corollary 4.6

Proof