1 Introduction

We study the algorithmic problem of approximately finding all of the eigenvalues and eigenvectors of a given arbitrary \(n\times n\) complex matrix. While this problem is quite well-understood in the special case of Hermitian matrices (see, e.g., [52]), the general non-Hermitian case has remained mysterious from a theoretical standpoint even after several decades of research. In particular, the currently best-known provable algorithms for this problem run in time \(O(n^{10}/\delta ^2)\) [2] or \(O(n^c\log (1/\delta ))\) [17] with \(c\ge 12\) where \(\delta >0\) is the desired accuracy, depending on the model of computation and notion of approximation considered.Footnote 1 To be sure, the non-Hermitian case is well-motivated: coupled systems of differential equations, linear dynamical systems in control theory, transfer operators in mathematical physics, and the nonbacktracking matrix in spectral graph theory are but a few situations where finding the eigenvalues and eigenvectors of a non-Hermitian matrix is important.

The key difficulties in dealing with non-normal matrices are the interrelated phenomena of non-orthogonal eigenvectors and spectral instability, the latter referring to extreme sensitivity of the eigenvalues and invariant subspaces to perturbations of the matrix. Non-orthogonality slows down convergence of standard algorithms such as the power method, and spectral instability can force the use of very high precision arithmetic, also leading to slower algorithms. Both phenomena together make it difficult to reduce the eigenproblem to a subproblem by “removing” an eigenvector or invariant subspace, since this can only be done approximately and one must control the spectral stability of the subproblem in order to be able to rigorously reason about it.

In this paper, we overcome these difficulties by identifying and leveraging a phenomenon we refer to as pseudospectral shattering: adding a small complex Gaussian perturbation to any matrix typically yields a matrix with well-conditioned eigenvectors and a large minimum gap between the eigenvalues, implying spectral stability. Previously, even the existence of such a regularizing perturbation with favorable parameters was not known [20]. This result builds on the recent solution of Davies’ conjecture [9] and is of independent interest in random matrix theory, where minimum eigenvalue gap bounds in the non-Hermitian case were previously only known for i.i.d. models [33, 55].

We complement the above by proving that a variant of the well-known spectral bisection algorithm in numerical linear algebra [11] is both fast and numerically stable when run on a pseudospectrally shattered matrix—we call an iterative algorithm numerically stable if it can be implemented using finite precision arithmetic with polylogarithmically many bits, corresponding to a dynamical system whose trajectory to the approximate solution is robust to adversarial noise (see, e.g. [57]). The key step in the bisection algorithm is computing the sign function of a matrix, a problem of independent interest in many areas such as control theory and approximation theory [44]. Our main algorithmic contribution is a rigorous analysis of the well-known Newton iteration method [53] for computing the sign function in finite arithmetic, showing that it converges quickly and numerically stably on matrices for which the sign function is well-conditioned, in particular on pseudospectrally shattered ones.

The end result is an algorithm which reduces the general diagonalization problem to a polylogarithmic (in the desired accuracy and dimension n) number of invocations of standard numerical linear algebra routines (multiplication, inversion, and QR factorization), each of which is reducible to matrix multiplication [22], yielding a nearly matrix multiplication runtime for the whole algorithm. This improves on the previously best-known running time of \(O(n^3+n^2\log (1/\delta ))\) arithmetic operations even in the Hermitian case ([21], see also [41, 52]), and yields the same improvement for the related problem of computing the singular value decomposition of a matrix.

We now proceed to give precise mathematical formulations of the eigenproblem and computational model, followed by statements of our results and a detailed discussion of related work.

1.1 Problem Statement

An eigenpair of a matrix \(A\in \mathbb {C}^{n\times n}\) is a tuple \((\lambda , v)\in \mathbb {C}\times \mathbb {C}^n\) such that

$$\begin{aligned} Av=\lambda v, \end{aligned}$$

and v is normalized to be a unit vector. The eigenproblem is the problem of finding a maximal set of linearly independent eigenpairs \((\lambda _i,v_i)\) of a given matrix A; note that an eigenvalue may appear more than once if it has geometric multiplicity greater than one. In the case when A is diagonalizable, the solution consists of exactly n eigenpairs, and if A has distinct eigenvalues then the solution is unique, up to the phases of the \(v_i\).

1.1.1 Accuracy and Conditioning

Due to the Abel–Ruffini theorem, it is impossible to have a finite-time algorithm which solves the eigenproblem exactly using arithmetic operations and radicals. Thus, all we can hope for is approximate eigenvalues and eigenvectors, up to a desired accuracy \(\delta >0\). There are two standard notions of approximation. We assume \(\Vert A\Vert \le 1\) for normalization, where throughout this work, \(\Vert \cdot \Vert \) denotes the spectral norm (the \(\ell ^2 \rightarrow \ell ^2\) operator norm).

Forward Approximation. Compute pairs \((\lambda _i',v_i')\) such that

$$\begin{aligned} |\lambda _i-\lambda _i'|\le \delta \quad \text {and}\quad \Vert v_i-v_i'\Vert \le \delta \end{aligned}$$

for the true eigenpairs \((\lambda _i,v_i)\), i.e., find a solution close to the exact solution. This makes sense in contexts where the exact solution is meaningful, e.g., the matrix is of theoretical/mathematical origin, and unstable (in the entries) quantities such as eigenvalue multiplicity can have a significant meaning.

Backward Approximation. Compute \((\lambda _i',v_i')\) which are the exact eigenpairs of a matrix \(A'\) satisfying

$$\begin{aligned} \Vert A'-A\Vert \le \delta , \end{aligned}$$

i.e., find the exact solution to a nearby problem. This is the appropriate and standard notion in scientific computing, where the matrix is of physical or empirical origin and is not assumed to be known exactly (and even if it were, roundoff error would destroy this exactness). Note that since diagonalizable matrices are dense in \(\mathbb {C}^{n\times n}\), one can hope to always find a complete set of eigenpairs for some nearby \(A'=VDV^{-1}\), yielding an approximate diagonalization of A:

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \delta . \end{aligned}$$
(1)

Note that the eigenproblem in either of the above formulations is not easily reducible to the problem of computing eigenvalues, since they can only be computed approximately and it is not clear how to obtain approximate eigenvectors from approximate eigenvalues. We now introduce a condition number for the eigenproblem, which measures the sensitivity of the eigenpairs of a matrix to perturbations and allows us to relate its forward and backward approximate solutions.

Condition Numbers. For diagonalizable A, the eigenvector condition number of A, denoted \(\kappa _V(A)\), is defined as:

$$\begin{aligned} \kappa _V(A):=\inf _{V} \Vert V\Vert \Vert V^{-1}\Vert , \end{aligned}$$
(2)

where the infimum is over all invertible V such that \(A=VDV^{-1}\) for some diagonal D, and its minimum eigenvalue gap is defined as:

$$\begin{aligned} \mathrm {gap}(A):=\min _{i\ne j}|\lambda _i(A)-\lambda _j(A)|, \end{aligned}$$

where \(\lambda _i\) are the eigenvalues of A (with multiplicity).

We define the condition number of the eigenproblem to beFootnote 2:

$$\begin{aligned} \kappa _{\mathrm {eig}}(A):=\frac{\kappa _V(A)}{\mathrm {gap}(A)}\in [0,\infty ]. \end{aligned}$$
(3)

It follows from the proposition below (whose proof appears in Sect. 2.2) that a \(\delta \)-backward approximate solution of the eigenproblem is a \(6n\kappa _{\mathrm {eig}}(A)\delta \)-forward approximate solution.Footnote 3

Proposition 1.1

If \(\Vert A\Vert ,\Vert A'\Vert \le 1\), \(\Vert A-A'\Vert \le \delta \), and \(\{(v_i,\lambda _i)\}_{i\le n}\), \(\{(v_i',\lambda _i')\}_{i\le n}\) are eigenpairs of \(A,A'\) with distinct eigenvalues, and \(\delta < \frac{\mathrm {gap}(A)}{8 \kappa _V(A)}\), then

$$\begin{aligned} \Vert v_i'-v_i\Vert \le 2n\kappa _{\mathrm {eig}}(A) \delta \quad \text { and }\quad \Vert \lambda _i'-\lambda _i\Vert \le \kappa _V(A) \delta \le 2\kappa _{\mathrm {eig}}(A) \delta \quad \forall i=1,\ldots ,n,\nonumber \\ \end{aligned}$$
(4)

after possibly multiplying the \(v_i\) by phases.

Note that \(\kappa _{\mathrm {eig}}=\infty \) if and only if A has a double eigenvalue; in this case, a relation like (4) is not possible since different infinitesimal changes to A can produce macroscopically different eigenpairs.

In this paper we will present a backward approximation for the eigenproblem with running time scaling polynomially in \(\log (1/\delta )\), which by (4) yields a forward approximation algorithm with running time scaling polynomially in \(\log (1/\kappa _{\mathrm {eig}}\delta )\).

Remark 1.2

(Multiple Eigenvalues) A backward approximation algorithm for the eigenproblem can be used to accurately find bases for the eigenspaces of matrices with multiple eigenvalues, but quantifying the forward error requires introducing condition numbers for invariant subspaces rather than eigenpairs. A standard treatment of this can be found in any numerical linear algebra textbook, e.g. [26], and we do not discuss it further in this paper for simplicity of exposition.

1.1.2 Models of Computation

These questions may be studied in various computational models: exact real arithmetic (i.e., infinite precision), variable precision rational arithmetic (rationals are stored exactly as numerators and denominators), and finite precision arithmetic (real numbers are rounded to a fixed number of bits which may depend on the input size and accuracy). Only the last two models yield actual Boolean complexity bounds, but introduce a second source of error stemming from the fact that computers cannot exactly represent real numbers.

We study the third model in this paper, axiomatized as follows.

Finite Precision Arithmetic. We use the standard floating point axioms from [39]. Numbers are stored and manipulated approximately up to some machine precision \({\textbf {u }}:={\textbf {u }}(\delta ,n)>0\), which for us will depend on the instance size n and desired accuracy \(\delta \). This means every number \(x\in \mathbb {C}\) is stored as \(\mathsf {fl}(x)=(1+\Delta )x\) for some adversarially chosen \(\Delta \in \mathbb {C}\) satisfying \(|\Delta |\le {\textbf {u }}\), and each arithmetic operation \(\circ \in \{+,-,\times ,\div \}\) is guaranteed to yield an output satisfying

$$\begin{aligned} \mathsf {fl}(x\circ y) = (x\circ y)(1+\Delta )\quad |\Delta |\le {\textbf {u }}. \end{aligned}$$

It is also standard and convenient to assume that we can evaluate \(\sqrt{x}\) for any \(x\in \mathbb {R}\), where again \(\mathsf {fl}(\sqrt{x}) = \sqrt{x} (1 + \Delta )\) for \(|\Delta | \le {\textbf {u }}\).

Thus, the outcomes of all operations are adversarially noisy due to roundoff. The bit lengths of numbers stored in this form remain fixed at \(\lg (1/{\textbf {u }})\), where \(\lg \) denotes the logarithm base 2. The bit complexity of an algorithm is therefore the number of arithmetic operations times \(O^*(\log (1/{\textbf {u }}))\), the running time of standard floating point arithmetic, where the \(*\) suppresses \(\log \log (1/{\textbf {u }})\) factors. We will state all running times in terms of arithmetic operations accompanied by the required number of bits of precision, which thereby immediately imply bit complexity bounds.

Remark 1.3

(Overflow, Underflow, and Additive Error) Using p bits for the exponent in the floating-point representation allows one to represent numbers with magnitude in the range \([2^{-2^p},2^{2^p}]\). It can be easily checked that all of the nonzero numbers, norms, and condition numbers appearing during the execution of our algorithms lie in the range \([2^{-\lg ^c(n/\delta )},2^{\lg ^c(n/\delta )}]\) for some small c, so overflow and underflow do not occur. In fact, we could have analyzed our algorithm in a computational model where every number is simply rounded to the nearest rational with denominator \(2^{\lg ^c(n/\delta )}\)—corresponding to additive arithmetic errors. We have chosen to use the multiplicative error floating point model since it is the standard in numerical analysis, but our algorithms do not exploit any subtleties arising from the difference between the two models.

The advantages of the floating point model are that it is realistic and potentially yields very fast algorithms by using a small number of bits of precision (polylogarithmic in n and \(1/\delta \)), in contrast to rational arithmetic, where even a simple operation such as inverting an \(n\times n\) integer matrix requires n extra bits of precision (see, e.g., Chapter 1 of [35]). An iterative algorithm that can be implemented in finite precision (typically, polylogarithmic in the input size and desired accuracy) is called numerically stable.

The disadvantage of the model is that it is only possible to compute forward approximations of quantities which are well-conditioned in the input—in particular, discontinuous quantities such as eigenvalue multiplicity cannot be computed in the floating point model, since it is not even assumed that the input is stored exactly.

1.2 Results and Techniques

In addition to \(\kappa _{\mathrm {eig}}\), we will need some more refined quantities to measure the stability of the eigenvalues and eigenvectors of a matrix to perturbations, and to state our results regarding it. The most important of these is the \(\epsilon \)-pseudospectrum, defined for any \(\epsilon >0\) and \(M\in \mathbb {C}^{n\times n}\) as:

$$\begin{aligned} \Lambda _\epsilon (M)&:= \left\{ \lambda \in \mathbb {C}: \lambda \in \Lambda (M + E) \text { for some }\Vert E\Vert < \epsilon \right\} \end{aligned}$$
(5)
$$\begin{aligned}&= \left\{ \lambda \in \mathbb {C}: \left\| (\lambda - M)^{-1}\right\| > 1/\epsilon \right\} \end{aligned}$$
(6)

where \(\Lambda (\cdot )\) denotes the spectrum of a matrix. The equivalence of (5) and (6) is simple and can be found in the excellent book [62].

Eigenvalue Gaps, \(\kappa _V\), and Pseudospectral Shattering. The key probabilistic result of the paper is that a random complex Gaussian perturbation of any matrix yields a nearby matrix with large minimum eigenvalue gap and small \(\kappa _V\).

Theorem 1.4

(Smoothed Analysis of \(\mathrm {gap}\) and \(\kappa _V\)) Suppose \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\), and \(\gamma \in (0,1/2)\). Let \(G_n\) be an \(n\times n\) matrix with i.i.d. complex Gaussian \(N(0,1_\mathbb {C}/n)\) entries, and let \(X:=A+\gamma G_n\). Then

$$\begin{aligned} \kappa _V(X)\le \frac{n^2}{\gamma }, \quad \mathrm {gap}(X)\ge \frac{\gamma ^4}{n^5}, \quad \text {and}\quad \Vert G_n\Vert \le 4, \end{aligned}$$

with probability at least \(1-12/n\).

The proof of Theorem 1.4 appears in Sect. 3.1. The key idea is to first control \(\kappa _V(X)\) using [9] and then observe that for a matrix with small \(\kappa _V\), two eigenvalues of X near a complex number z imply a small second-least singular value of \(z-X\), which we are able to control.

In Sect. 3.2 we develop the notion of pseudospectral shattering, which is implied by Theorem 1.4 and says roughly that the pseudospectrum consists of n components that lie in separate squares of an appropriately coarse grid in the complex plane. This is useful in the analysis of the spectral bisection algorithm in Sect. 5.

Matrix Sign Function. The sign function of a number \(z\in \mathbb {C}\) with \({\text {Re}}(z)\ne 0\) is defined as \(+1\) if \({\text {Re}}(z)>0\) and \(-1\) if \({\text {Re}}(z)<0\). The matrix sign function of a matrix A with Jordan normal form

$$\begin{aligned} A = V\begin{bmatrix} N &{} \\ &{} P\end{bmatrix}V^{-1}, \end{aligned}$$

where N (resp. P) has eigenvalues with strictly negative (resp. positive) real part, is defined as

$$\begin{aligned} \mathrm {sgn}(A) = V\begin{bmatrix} -I_N &{} \\ &{} I_P\end{bmatrix} V^{-1}, \end{aligned}$$

where \(I_P\) denotes the identity of the same size as P. The sign function is undefined for matrices with eigenvalues on the imaginary axis. Quantifying this discontinuity, Bai and Demmel [4] defined the following condition number for the sign function:

$$\begin{aligned} \kappa _{\mathrm {sign}}(M) := \inf \left\{ 1/\epsilon ^2 : \Lambda _\epsilon (M) \text { does not intersect the imaginary axis} \right\} , \end{aligned}$$
(7)

and gave perturbation bounds for \(\mathrm {sgn}(M)\) depending on \(\kappa _{\mathrm {sign}}\).

Roberts [53] showed that the simple iteration

$$\begin{aligned} A_{k+1}=\frac{A_k+A_k^{-1}}{2} \end{aligned}$$
(8)

converges globally and quadratically to \(\mathrm {sgn}(A)\) in exact arithmetic, but his proof relies on the fact that all iterates of the algorithm are simultaneously diagonalizable, a property which is destroyed in finite arithmetic since inversions can only be done approximately.Footnote 4 In Sect. 4 we show that this iteration is indeed convergent when implemented in finite arithmetic for matrices with small \(\kappa _{\mathrm {sign}}\), given a numerically stable matrix inversion algorithm. This leads to the following result:

Theorem 1.5

(Sign Function Algorithm) There is a deterministic algorithm \(\mathsf {SGN}\) which on input an \(n \times n\) matrix A with \(\Vert A\Vert \le 1\), a number K with \(K \ge \kappa _{\mathrm {sign}}(A)\), and a desired accuracy \(\beta \in (0, 1/12)\), outputs an approximation \(\mathsf {SGN}(A)\) with

$$\begin{aligned} \Vert \mathsf {SGN}(A)-\mathrm {sgn}(A)\Vert \le \beta , \end{aligned}$$

in

$$\begin{aligned} O( (\log K + \log \log (1/\beta )) T_{\mathsf {INV}}(n) ) \end{aligned}$$
(9)

arithmetic operations on a floating point machine with

$$\begin{aligned} \lg (1/{\textbf {u }}) = O(\log n \log ^3 K (\log K + \log (1/\beta ))) \end{aligned}$$

bits of precision, where \(T_\mathsf {INV}(n)\) denotes the number of arithmetic operations used by a numerically stable matrix inversion algorithm (satisfying Definition 2.7).

The main new idea in the proof of Theorem 1.5 is to control the evolution of the pseudospectra \(\Lambda _{\epsilon _k}(A_k)\) of the iterates with appropriately decreasing (in k) parameters \(\epsilon _k\), using a sequence of carefully chosen shrinking contour integrals in the complex plane. The pseudospectrum provides a richer induction hypothesis than scalar quantities such as condition numbers, and allows one to control all quantities of interest using the holomorphic functional calculus. This technique is introduced in Sects. 4.1 and 4.2, and carried out in finite arithmetic in Sect. 4.3, yielding Theorem 1.5.

Diagonalization by Spectral Bisection. Given an algorithm for computing the sign function, there is a natural and well-known approach to the eigenproblem pioneered in [11]. The idea is that the matrices \((I\pm \mathrm {sgn}(A))/2\) are spectral projectors onto the invariant subspaces corresponding to the eigenvalues of A in the left and right open half planes, so if some shifted matrix \(z + A\) or \(z + iA\) has roughly half its eigenvalues in each half plane, the problem can be reduced to smaller subproblems appropriate for recursion.

The two difficulties in carrying out the above approach are: (a) efficiently computing the sign function (b) finding a balanced splitting along an axis that is well-separated from the spectrum. These are nontrivial even in exact arithmetic, since the iteration (8) converges slowly if (b) is not satisfied, even without roundoff error. We use Theorem 1.4 to ensure that a good splitting always exists after a small Gaussian perturbation of order \(\delta \), and Theorem 1.5 to compute splittings efficiently in finite precision. Combining this with well-understood techniques such as rank-revealing QR factorization, we obtain the following theorem, whose proof appears in Sect. 5.1.

Theorem 1.6

(Backward Approximation Algorithm) There is a randomized algorithm \(\mathsf {EIG}\) which on input any matrix \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\) and a desired accuracy parameter \(\delta >0\) outputs a diagonal D and invertible V such that

$$\begin{aligned} \quad \Vert A-VDV^{-1}\Vert \le \delta \quad \mathrm {and}\quad \kappa (V) \le 32n^{2.5}/\delta \end{aligned}$$

in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O(\log ^4(n/\delta )\log n) \end{aligned}$$

bits of precision, with probability at least \(1-14/n\). Here \(T_\mathsf {MM}(n)\) refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Since there is a correspondence in terms of the condition number between backward and forward approximations, and as it is customary in numerical analysis, our discussion revolves around backward approximation guarantees. For convenience of the reader, we write down below the explicit guarantees that one gets by using (4) and invoking \(\mathsf {EIG}\) with accuracy \(\frac{\delta }{6n \kappa _{\mathrm {eig}}}\).

Corollary 1.7

(Forward Approximation Algorithm) There is a randomized algorithm which on input any matrix \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\), a desired accuracy parameter \(\delta >0\), and an estimate \(K\ge \kappa _{\mathrm {eig}}(A)\) outputs a \(\delta \)-forward approximate solution to the eigenproblem for A in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n K}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O(\log ^4(nK/\delta )\log n) \end{aligned}$$

bits of precision, with probability at least \(1-1/n-12/n^2\). Here \(T_\mathsf {MM}(n)\) refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Remark 1.8

(Accuracy vs. Precision) The gold standard of “backward stability” in numerical analysis postulates that

$$\begin{aligned} \log (1/{\textbf {u }}) = \log (1/\delta ) + \log (n), \end{aligned}$$

i.e., the number of bits of precision is linear in the number of bits of accuracy. The relaxed notion of “logarithmic stability” introduced in [23] requires

$$\begin{aligned} \log (1/{\textbf {u }}) = \log (1/\delta )+O(\log ^c(n)\log (\kappa )) \end{aligned}$$

for some constant c, where \(\kappa \) is an appropriate condition number. In comparison, Theorem 1.6 obtains the weaker relationship

$$\begin{aligned} \log (1/{\textbf {u }}) = O(\log ^4(1/\delta )\log (n) + \log ^5(n)), \end{aligned}$$

which is still polylogarithmic in n in the regime \(\delta =1/\mathrm {poly}(n)\).

1.3 Related Work

Minimum Eigenvalue Gap. The minimum eigenvalue gap of random matrices has been studied in the case of Hermitian and unitary matrices, beginning with the work of Vinson [64], who proved an \(\Omega (n^{-4/3})\) lower bound on this gap in the case of the Gaussian Unitary Ensemble (GUE) and the Circular Unitary Ensemble (CUE). Bourgade and Ben Arous [3] derived exact limiting formulas for the distributions of all the gaps for the same ensembles. Nguyen, Tao, and Vu [50] obtained non-asymptotic inverse polynomial bounds for a large class of non-integrable Hermitian models with i.i.d. entries (including Bernoulli matrices).

In a different direction, Aizenman et al. proved an inverse-polynomial bound [1] in the case of an arbitrary Hermitian matrix plus a GUE matrix or a Gaussian orthogonal ensemble (GOE) matrix, which may be viewed as a smoothed analysis of the minimum gap. Theorem 3.6 may be viewed as a non-Hermitian analogue of the last result.

In the non-Hermitian case, Ge [33] obtained an inverse polynomial bound for i.i.d. matrices with real entries satisfying some mild moment conditions, and [55]Footnote 5 proved an inverse polynomial lower bound for the complex Ginibre ensemble. Theorem 3.6 may be seen as a generalization of these results to non-centered complex Gaussian matrices.

Smoothed Analysis and Free Probability. The study of numerical algorithms on Gaussian random matrices (i.e., the case \(A=0\) of smoothed analysis) dates back to [25, 29, 56, 65]. The powerful idea of improving the conditioning of a numerical computation by adding a small amount of Gaussian noise was introduced by Spielman and Teng in [59], in the context of the simplex algorithm. Sankar, Spielman, and Teng [54] showed that adding real Gaussian noise to any matrix yields a matrix with polynomially bounded condition number; [9] can be seen as an extension of this result to the condition number of the eigenvector matrix, where the proof crucially requires that the Gaussian perturbation is complex rather than real. The main difference between our results and most of the results on smoothed analysis (including [2]) is that our running time depends logarithmically rather than polynomially on the size of the perturbation.

The broad idea of regularizing the spectral instability of a nonnormal matrix by adding a random matrix can be traced back to the work of Śniady [58] and Haagerup and Larsen [37] in the context of Free Probability theory.

Matrix Sign Function. The matrix sign function was introduced by Zolotarev in 1877. It became a popular topic in numerical analysis following the work of Beavers and Denman [10, 11, 27] and Roberts [53], who used it first to solve the algebraic Ricatti and Lyapunov equations and then as an approach to the eigenproblem; see [44] for a broad survey of its early history. The numerical stability of Roberts’ Newton iteration was investigated by Byers [14], who identified some cases where it is and isn’t stable. Malyshev [46], Byers et al. [15], Bai et al. [5], and Bai and Demmel [4] studied the condition number of the matrix sign function, and showed that if the Newton iteration converges then it can be used to obtain a high-quality invariant subspaceFootnote 6, but did not prove convergence in finite arithmetic and left this as an open question.Footnote 7 The key issue in analyzing the convergence of the iteration is to bound the condition numbers of the intermediate matrices that appear, as N. Higham remarks in his 2008 textbook:

Of course, to obtain a complete picture, we also need to understand the effect of rounding errors on the iteration prior to convergence. This effect is surprisingly difficult to analyze. \(\ldots \) Since errors will in general occur on each iteration, the overall error will be a complicated function of \(\kappa _{sign}(X_k)\) and \(E_k\) for all k. \(\ldots \) We are not aware of any published rounding error analysis for the computation of sign(A) via the Newton iteration.—[40, Section 5.7]

This is precisely the problem solved by Theorem 1.5, which is as far as we know the first provable algorithm for computing the sign function of an arbitrary matrix which does not require computing the Jordan form.

In the special case of Hermitian matrices, Higham [38] established efficient reductions between the sign function and the polar decomposition. Byers and Xu [16] proved backward stability of a certain scaled version of the Newton iteration for Hermitian matrices, in the context of computing the polar decomposition. Higham and Nakatsukasa [49] (see also the improvement [48]) proved backward stability of a different iterative scheme for computing the polar decomposition, and used it to give backward stable spectral bisection algorithms for the Hermitian eigenproblem with \(O(n^3)\)-type complexity.

Non-Hermitian Eigenproblem. Floating Point Arithmetic. The eigenproblem has been thoroughly studied in the numerical analysis community, in the floating point model of computation. While there are provably fast and accurate algorithms in the Hermitian case (see the next subsection) and a large body of work for various structured matrices (see, e.g., [13]), the general case is not nearly as well-understood. As recently as 1997, J. Demmel remarked in his well-known textbook [26]: “\(\ldots \) the problem of devising an algorithm [for the non-Hermitian eigenproblem] that is numerically stable and globally (and quickly!) convergent remains open.”

Demmel’s question remained entirely open until 2015, when it was answered in the following sense by Armentano, Beltrán, Bürgisser, Cucker, and Shub in the remarkable paper [2]. They exhibited an algorithm (see their Theorem 2.28) which given any \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\) and \(\sigma >0\) produces in \(O(n^{9}/\sigma ^2)\) expected arithmetic operations the diagonalization of the nearby random perturbation \(A+\sigma G\) where G is a matrix with standard complex Gaussian entries. By setting \(\sigma \) sufficiently small, this may be viewed as a backward approximation algorithm for diagonalization, in that it solves a nearby problem essentially exactlyFootnote 8—in particular, by setting \(\sigma =\delta /\sqrt{n}\) and noting that \(\Vert G\Vert =O(\sqrt{n})\) with very high probability, their result implies a running time of \(O(n^{10}/\delta ^2)\) in our setting. Their algorithm is based on homotopy continuation methods, which they argue informally are numerically stable and can be implemented in finite precision arithmetic. Our algorithm is similar on a high level in that it adds a Gaussian perturbation to the input and then obtains a high accuracy forward approximate solution to the perturbed problem. The difference is that their overall running time depends polynomially rather than logarithmically on the accuracy \(\delta \) desired with respect to the original unperturbed problem (Table 1).

Table 1 Results for finite-precision floating-point arithmetic

Other Models of Computation. If we relax the requirements further and ask for any provable algorithm in any model of Boolean computation, there is only one more positive result with a polynomial bound on the number of bit operations: Jin Yi Cai showed in 1994 [17] that given a rational \(n\times n\) matrix A with integer entries of bit length a, one can find an \(\delta \)-forward approximation to its Jordan Normal Form \(A=VJV^{-1}\) in time \(\mathrm {poly}(n,a,\log (1/\delta ))\), where the degree of the polynomial is at least 12. This algorithm works in the rational arithmetic model of computation, so it does not quite answer Demmel’s question since it is not a numerically stable algorithm. However, it enjoys the significant advantage of being able to compute forward approximations to discontinuous quantities such as the Jordan structure (Table 2).

Table 2 Results for other models of arithmetic

As far as we are aware, there are no other published provably polynomial-time algorithms for the general eigenproblem. The two standard references for diagonalization appearing most often in theoretical computer science papers do not meet this criterion. In particular, the widely cited work by Pan and Chen [51] proves that one can compute the eigenvalues of A in \(O(n^\omega + n\log \log (1/\delta ))\) (suppressing logarithmic factors) arithmetic operations by finding the roots of its characteristic polynomial, which becomes a bound of \(O(n^{\omega +1}a+n^2\log (1/\delta )\log \log (1/\delta ))\) bit operations if the characteristic polynomial is computed exactly in rational arithmetic and the matrix has entries of bit length a. However that paper does not give any bound for the amount of time taken to find approximate eigenvectors from approximate eigenvalues, and states this as an open problem.Footnote 9

Finally, the important work of Demmel et al. [22] (see also the followup [6]), which we rely on heavily, does not claim to provably solve the eigenproblem either—it bounds the running time of one iteration of a specific algorithm, and shows that such an iteration can be implemented numerically stably, without proving any bound on the number of iterations required in general.

Hermitian Eigenproblem. For comparison, the eigenproblem for Hermitian matrices is much better understood. We cannot give a complete bibliography of this huge area, but mention one relevant landmark result: the work of Wilkinson [66], who exhibited a globally convergent diagonalization algorithm, and the work of Dekker and Traub [21] who quantified the rate of convergence of Wilkinson’s algorithm and from which it follows that the Hermitian eigenproblem can be solved with backward error \(\delta \) in \(O(n^3+n^2\log (1/\delta ))\) arithmetic operations in exact arithmetic.Footnote 10 We refer the reader to [52,  §8.10] for the simplest and most insightful proof of this result, due to Hoffman and Parlett [41]

There has also recently been renewed interest in this problem in the theoretical computer science community, with the goal of bringing the runtime close to \(O(n^\omega )\): Louis and Vempala [45] show how to find a \(\delta \)-approximation of just the largest eigenvalue in \(O(n^\omega \log ^4(n)\log ^2(1/\delta ))\) bit operations, and Ben-Or and Eldar [12] give an \(O(n^{\omega +1}\mathrm {polylog}(n))\)-bit-operation algorithm for finding a \(1/\mathrm {poly}(n)\)-approximate diagonalization of an \(n\times n\) Hermitian matrix normalized to have \(\Vert A\Vert \le 1\).

Remark 1.9

(Davies’ Conjecture) The beautiful paper [20] introduced the idea of approximating a matrix function f(A) for nonnormal A by \(f(A+E)\) for some well-chosen E regularizing the eigenvectors of A. This directly inspired our approach to solving the eigenproblem via regularization.

The existence of an approximate diagonalization (1) for every A with a well-conditioned similarity V (i.e, \(\kappa (V)\) depending polynomially on \(\delta \) and n) was precisely the content of Davies’ conjecture [20], which was recently solved by some of the authors and Mukherjee in [9]. The existence of such a V is a pre-requisite for proving that one can always efficiently find an approximate diagonalization in finite arithmetic, since if \(\Vert V\Vert \Vert V^{-1}\Vert \) is very large it may require many bits of precision to represent. Thus, Theorem 1.6 can be viewed as an efficient algorithmic answer to Davies’ question.

Remark 1.10

(Subsequent work in Random Matrix Theory) Since the first version of the present paper was made public there have been some advances in random matrix theory [8, 43] that prove analogues of Theorem 1.4 in the case where \(G_n\) is replaced by a perturbation with random real independent entries. These results formally articulate that, in the context of this paper, there is nothing special about complex Ginibre matrices, and that the same regularization effect can be achieved using a broader class of perturbations. Bounding the eigenvector condition number and the eigenvalue gap when the random perturbation has real entries poses interesting technical challenges that were tackled in different ways in the aforementioned papers. We also refer the reader to [18] where optimal results were obtained in the case where \(A=0\) and \(G_n\) has real Gaussian entries.

Remark 1.11

(Alternate Proofs using [2]) In October 2021 (about two years after the first appearance of this paper), we noticed that a version of Theorem 1.4 (with a worse \(\kappa _V\) bound but a better eigenvalue gap bound) as well as the main theorem of [9] (with a slightly worse dependence on n) can be easily derived from some auxiliary results shown in [2] (specifically Proposition 2.7 and Theorem 2.14 of that paper), which we were not previously aware of. We present these short alternate proofs in “Appendix D”. We remark that our original proofs are essentially different from those appearing in [2]—in particular, they rely on studying the area of pseudospectra, whereas the proof of Theorem 2.14 of [2] relies on geometric concepts and the coarea formula for Gaussian integrals of certain determinantal quantities on Riemannian manifolds. The proofs based on pseudospectra are arguably more flexible; as mentioned in Remark 1.10, they have been recently generalized to ensembles besides the complex Ginibre ensemble, which seems difficult to do for the more algebraic proofs of [2].

Reader Guide. This paper contains a lot of parameters and constants. On first reading, it may be good to largely ignore the constants not appearing in exponents and to keep in mind the typical setting \(\delta =1/\mathrm {poly}(n)\) for the accuracy, in which case the important auxiliary parameters \(\omega , 1-\alpha , \epsilon , \beta , \eta \) are all \(1/\mathrm {poly}(n)\), and the machine precision is \(\log (1/{\textbf {u }})=\mathrm {polylog}(n)\).

2 Preliminaries

Let \(M \in \mathbb {C}^{n\times n}\) be a complex matrix, not necessarily normal. We will write matrices and vectors with uppercase and lowercase letters, respectively. Let us denote by \(\Lambda (M)\) the spectrum of M and by \(\lambda _i(M)\) its individual eigenvalues. In the same way we denote the singular values of M by \(\sigma _i(M)\) and we adopt the convention \(\sigma _1(M) \ge \sigma _2(M) \ge \cdots \ge \sigma _n(M)\). When M is clear from the context we will simplify notation and just write \(\Lambda , \lambda _i\) or \(\sigma _i\), respectively.

Recall that the operator norm of M is

$$\begin{aligned} \Vert M\Vert = \sigma _1(M) = \sup _{\Vert x \Vert = 1}\Vert Mx\Vert . \end{aligned}$$

As usual, we will say that M is diagonalizable if it can be written as \(M = VDV^{-1}\) for some diagonal matrix D whose nonzero entries contain the eigenvalues of M. In this case, we have the spectral expansion

$$\begin{aligned} M = \sum _{i=1}^n \lambda _i v_i w_i^*, \end{aligned}$$
(10)

where the right and left eigenvectors \(v_i\) and \(w_j^*\) are the columns and rows of V and \(V^{-1}\), respectively, normalized so that \(w^*_i v_i = 1\).

2.1 Spectral Projectors and Holomorphic Functional Calculus

Let \(M \in \mathbb {C}^{n\times n}\), with eigenvalues \(\lambda _1,...,\lambda _n\). We say that a matrix P is a spectral projector for M if \(MP = PM\) and \(P^2 = P\). For instance, each of the terms \(v_i w_i^*\) appearing in the spectral expansion (10) is a spectral projector, as \(Av_iw_i^*= \lambda _i v_i w_i^*= v_i w_i^*A\) and \(w_i^*v_i = 1\). If \(\Gamma _i\) is a simple closed positively oriented rectifiable curve in the complex plane separating \(\lambda _i\) from the rest of the spectrum, then it is well known that

$$\begin{aligned} v_i w_i^*= \frac{1}{2\pi i}\oint _{\Gamma _i} (z - M)^{-1}d z, \end{aligned}$$

by taking the Jordan normal form of the resolvent \((z - M)^{-1}\) and applying Cauchy’s integral formula.Footnote 11

Since every spectral projector P commutes with M, its range agrees exactly with an invariant subspace of M. We will often find it useful to choose some region of the complex plane bounded by a simple closed positively oriented rectifiable curve \(\Gamma \), and compute the spectral projector onto the invariant subspace spanned by those eigenvectors whose eigenvalues lie inside \(\Gamma \). Such a projector can be computed by a contour integral analogous to the above.

Recall that if f is any function, and M is diagonalizable, then we can meaningfully define \(f(M) := V f(D) V^{-1}\), where f(D) is simply the result of applying f to each element of the diagonal matrix D. The holomorphic functional calculus gives an equivalent definition that extends to the case when M is non-diagonalizable. As we will see, it has the added benefit that bounds on the norm of the resolvent of M can be converted into bounds on the norm of f(M).

Proposition 2.1

(Holomorphic Functional Calculus) Let M be any matrix, \(B \supset \Lambda (M)\) be an open neighborhood of its spectrum (not necessarily connected), and \(\Gamma _1,...,\Gamma _k\) be simple closed positively oriented rectifiable curves in B whose interiors together contain all of \(\Lambda (M)\). Then if f is holomorphic on B, the definition

$$\begin{aligned} f(M) := \frac{1}{2\pi i}\sum _{j=1}^k \oint _{\Gamma _j} f(z)(z - M)^{-1}d z \end{aligned}$$

is an algebra homomorphism in the sense that \((fg)(M) = f(M)g(M)\) for any f and g holomorphic on B.

Finally, we will frequently use the resolvent identity

$$\begin{aligned} (z - M)^{-1} - (z - M')^{-1} = (z-M)^{-1}(M - M')(z - M')^{-1} \end{aligned}$$

to analyze perturbations of contour integrals.

2.2 Pseudospectrum and Spectral Stability

The \(\epsilon \)-pseudospectrum of a matrix is defined in (5). Directly from this definition, we can relate the pseudospectra of a matrix and a perturbation of it.

Proposition 2.2

([62], Theorem 52.4) For any \(n \times n\) matrices M and E and any \(\epsilon > 0\), \(\Lambda _{\epsilon - \Vert E\Vert }(M) \subseteq \Lambda _\epsilon (M+E)\).

It is also immediate that \(\Lambda (M) \subset \Lambda _\epsilon (M)\), and in fact a stronger relationship holds as well:

Proposition 2.3

([62], Theorem 4.3) For any \(n \times n\) matrix M, any bounded connected component of \(\Lambda _\epsilon (M)\) must contain an eigenvalue of M.

Several other notions of stability will be useful to us as well. If M has distinct eigenvalues \(\lambda _1,\ldots ,\lambda _n\), and spectral expansion as in (10), we define the eigenvalue condition number of \(\lambda _i\) to be

$$\begin{aligned} \kappa (\lambda _i) := \left\| {v_i w_i^*}\right\| =\Vert v_i\Vert \Vert w_i\Vert . \end{aligned}$$

By considering the scaling of V in (2) in which its columns \(v_i\) have unit length, so that \(\kappa (\lambda _i) = \Vert w_i \Vert \), we obtain the useful relationship

$$\begin{aligned} \kappa _V(M)\le \Vert V \Vert \Vert V^{-1} \Vert \le \Vert V\Vert _F\Vert V^{-1}\Vert _F\le \sqrt{n\cdot \sum _{i\le n} \kappa (\lambda _i)^2}. \end{aligned}$$
(11)

Note also that the eigenvector condition number and pseudospectrum are related as follows:

Lemma 2.4

([62]) Let D(zr) denote the open disk of radius r centered at \(z \in \mathbb {C}\). For every \(M \in \mathbb {C}^{n\times n}\),

$$\begin{aligned} \bigcup _i D(\lambda _i,\epsilon )\subset \Lambda _\epsilon (M)\subset \bigcup _i D(\lambda _i, \epsilon \kappa _V(M)). \end{aligned}$$
(12)

In this paper we will repeatedly use that assumptions about the pseudospectrum of a matrix can be turned into stability statements about functions applied to the matrix via the holomorphic functional calculus. Here we describe an instance of particular importance.

Let \(\lambda _i\) be a simple eigenvalue of M and let \(\Gamma _i\) be a contour in the complex plane, as in Sect. 2.1, separating \(\lambda _i\) from the rest of the spectrum of M, and assume \(\Lambda _\epsilon (M)\cap \Gamma =\emptyset \). Then, for any \(\Vert M-M'\Vert< \eta <\epsilon \), a combination of Proposition 2.2 and Proposition 2.3 implies that there is a unique eigenvalue \(\lambda _i'\) of \(M'\) in the region enclosed by \(\Gamma \), and furthermore \(\Lambda _{\epsilon -\eta }(M')\cap \Gamma = \emptyset \). If \(v_i'\) and \(w_i'\) are the right and left eigenvectors of \(M'\) corresponding to \(\lambda _i'\) we have

$$\begin{aligned} \Vert v_i'w_i'^*- v_iw_i^*\Vert&= \frac{1}{2\pi } \left\| \oint _\Gamma (z - M)^{-1} - (z-M')^{-1} d z \right\| \nonumber \\&= \frac{1}{2\pi }\left\| \oint _\Gamma (z - M)^{-1}(M - M')(z-M')^{-1} d z \right\| \nonumber \\&\le \frac{\ell (\Gamma )}{2\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$
(13)

We have introduced enough tools to prove Proposition 1.1.

Proof of Proposition 1.1

For \(t\in [0, 1]\) define \(A(t) = (1-t)A+ tA' \). Since \(\delta <\frac{\mathrm {gap}(A)}{8\kappa _V(A)}\) the Bauer–Fike theorem implies that A(t) has distinct eigenvalues for all t, and in fact \(\mathrm {gap}(A(t))\ge \frac{3\mathrm {gap}(A)}{4}\). Standard results in perturbation theory (for instance [34,  Theorem 1] or any of the references therein) imply that for every \(i=1, \dots , n\), A(t) has a unique eigenvalue \(\lambda _i(t)\) such that \(\lambda _i(t)\) is a differentiable trajectory, \(\lambda _i(0) =\lambda _i\) and \(\lambda _i(1)=\lambda _i'\). Let \(P_i(t)\) be the associated spectral projector of \(\lambda _i(t)\), which is uniquely defined via a contour integral, and write \(P_i = P_i(0)\).

Let \(\Gamma _i\) be the positively oriented contour forming the boundary of the closed disk centered at \(\lambda _i\) with radius \(\mathrm {gap}(A)/2\), and define \(\epsilon =\frac{\mathrm {gap}(A)}{2\kappa _V(A)}\). Lemma 2.4 implies \(\Lambda _{\epsilon }(A)\) is contained in the union of these disks over all \(i \in [n]\), and for fixed \(t\in [0, 1]\), since \(\Vert A-A(t)\Vert < t \delta \le \epsilon /4\), Proposition 2.2 gives the same containment for \(\Lambda _{3\epsilon /4}(A(t))\). Since these disks intersect only in their boundaries (if they do at all), \(\Vert (z - A)^{-1}\Vert \le 1/\epsilon \) and \(\Vert (z - A(t))^{-1}\Vert \le 4/3\epsilon \) for \(z \in \Gamma _i\). By the derivation of (13) above,

$$\begin{aligned} |\kappa (\lambda _i)-\kappa (\lambda _i(t))|\le & {} \Vert P_i(t) - P_i\Vert \le \frac{\ell (\Gamma _i)}{2\pi } \cdot \frac{1}{\epsilon }\cdot \frac{4}{3\epsilon }\cdot \frac{\epsilon }{4}\\= & {} \frac{\mathrm {gap}(A)}{2}\frac{2\kappa _V(A)}{3\mathrm {gap}(A)} = \frac{\kappa _V(A)}{3} \end{aligned}$$

and hence \(\kappa (\lambda _i(t))\le \kappa (\lambda _i)+\kappa _V(A)/3 \le 4\kappa _V(A)/3\). Combining this with (11) we obtain

$$\begin{aligned} \kappa _V(A(t)) \le 2 \sqrt{n \cdot \sum _i \kappa (\lambda _i)^2}< 4n \kappa _V(A)/3. \end{aligned}$$

From Theorem 2 of [34] and the subsequent discussion on p. 468, there exist analytic functions \(v_i(t)\) satisfying \(v_i(0) = v_i\) and \(A(t)v_i(t) = \lambda _i(t)v_i(t)\) for all \(i \in [n]\) and \(t \in [0,1]\), which furthermore admit the bound

$$\begin{aligned} \Vert \dot{v_i}(t)\Vert \le \frac{\kappa _V(A(t))}{\mathrm {gap}(A(t))}\Vert \dot{A}(t)\Vert \Vert v_i(t)\Vert \le \frac{\delta \kappa _V(A(t))}{\mathrm {gap}(A(t)}\Vert v_i(t)\Vert . \end{aligned}$$

However, these \(v_i(t)\) need not in general be unit vectors (see [34,  Section 3.4] and references for discussion of various normalizations). Therefore set \({\hat{v}}_i(t) = \Vert v_i(t)\Vert ^{-1} v_i(t)\), and note that by an application of the chain rule,

$$\begin{aligned} \Vert \dot{\hat{v_i}}(t)\Vert \le \frac{\delta \kappa _V(A(t))}{\mathrm {gap}(A(t))}. \end{aligned}$$

It then follows that the vectors \(v_i' = \hat{v_i}(1)\) for \(i \in [n]\) satisfy the conclusion of the theorem, by bounding \(\kappa _V(A(t))\le 4n\kappa _V(A)/3\) and \(\mathrm {gap}(A(t))\ge \frac{3\mathrm {gap}(A)}{4}\), and integrating the resulting upper bound \(\Vert \dot{\hat{v_i}}(t)\Vert \le \frac{16n\delta \kappa _V(A)}{9\mathrm {gap}(A)}\) from \(t = 0\) to \(t= 1\). \(\square \)

2.3 Finite-Precision Arithmetic

We briefly elaborate on the axioms for floating-point arithmetic given in Sect. 1.1. Similar guarantees to the ones appearing in that section for scalar-scalar operations also hold for operations such as matrix–matrix addition and matrix-scalar multiplication. In particular, if A is an \(n\times n\) complex matrix,

$$\begin{aligned} \mathsf {fl}(A) = A + A \circ \Delta \qquad |\Delta _{i,j}| < {\textbf {u }}. \end{aligned}$$

It will be convenient for us to write such errors in additive, as opposed to multiplicative form. We can convert the above to additive error as follows. Recall that for any \(n\times n\) matrix, the spectral norm (the \(\ell ^2 \rightarrow \ell ^2\) operator norm) is at most \(\sqrt{n}\) times the \(\ell ^2 \rightarrow \ell ^1\) operator norm, i.e. the maximal norm of a column. Thus, we have

$$\begin{aligned} \Vert A \circ \Delta \Vert \le \sqrt{n} \max _i \Vert (A\circ \Delta ) e_i\Vert \le \sqrt{n}\max _{i,j}|\Delta _{i,j}| \max _i \Vert A e_i\Vert \le {\textbf {u }}\sqrt{n}\Vert A\Vert . \end{aligned}$$
(14)

For more complicated operations such as matrix–matrix multiplication and matrix inversion, we use existing error guarantees from the literature. This is the subject of Sect. 2.5.

We will also need to compute the trace of a matrix \(A \in \mathbb {C}^{n\times n}\), and normalize a vector \(x \in \mathbb {C}^n\). Error analysis of these is standard (see for instance the discussion in [39,  Chapters 3–4]), and the results in this paper are highly insensitive to the details. For simplicity, calling \({\hat{x}} := x/\Vert x\Vert \), we will assume that

$$\begin{aligned} |\mathsf {fl}\left( \mathrm {Tr}A\right) - \mathrm {Tr}A|&\le n\Vert A\Vert {\textbf {u }} \end{aligned}$$
(15)
$$\begin{aligned} \Vert \mathsf {fl}({\hat{x}}) - {\hat{x}}\Vert&\le n{\textbf {u }}. \end{aligned}$$
(16)

Each of these can be achieved by assuming that \({\textbf {u }}n \le \epsilon \) for some suitably chosen \(\epsilon \), independent of n, a requirement which will be depreciated shortly by several tighter assumptions on the machine precision.

Throughout the paper, we will take the pedagogical perspective that our algorithms are games played between the practitioner and an adversary who may additively corrupt each operation. In particular, we will include explicit error terms (always denoted by \(E_{(\cdot )}\)) in each appropriate step of every algorithm. In many cases we will first analyze a routine in exact arithmetic—in which case the error terms will all be set to zero—and subsequently determine the machine precision \({\textbf {u }}\) necessary so that the errors are small enough to guarantee convergence.

2.4 Sampling Gaussians in Finite Precision

For various parts of the algorithm, we will need to sample from normal distributions. For our model of arithmetic, we assume that the complex normal distribution can be sampled up to machine precision in O(1) arithmetic operations. To be precise, we assume the existence of the following sampler:

Definition 2.5

(Complex Gaussian Sampling)

A \(c_{\mathsf {N}}\)-stable Gaussian sampler \(\mathsf {N}(\sigma )\) takes as input \(\sigma \in \mathbb {R}_{\ge 0}\) and outputs a sample of a random variable \({\widetilde{G}} = \mathsf {N}(\sigma )\) with the property that there exists \(G \sim N_{\mathbb {C}}(0, \sigma ^2)\) satisfying

$$\begin{aligned} |{\widetilde{G}} - G| \le c_{\mathsf {N}}\sigma \cdot {\textbf {u }}\end{aligned}$$

with probability one, in at most \(T_\mathsf {N}\) arithmetic operations for some universal constant \(T_\mathsf {N}>0\).

Note that, since the Gaussian distribution has unbounded support, one should only expect the sampler \(\mathsf {N}(\sigma )\) to have a relative error guarantee of the sort \(|{\widetilde{G}} - G| \le c_{\mathsf {N}}\sigma |G| \cdot {\textbf {u }}\). However, as it will become clear below, we only care about realizations of Gaussians satisfying \(|G|<R\), for a certain prespecified \(R>0\), and the rare event \(|G|>R\) will be accounted for in the failure probability of the algorithm. So, for the sake of exposition we decided to omit the |G| in the bound on \(|{\widetilde{G}}-G|\).

We will only sample \(O(n^2)\) Gaussians during the algorithm, so this sampling will not contribute significantly to the runtime. Here as everywhere in the paper, we will omit issues of underflow or overflow. Throughout this paper, to simplify some of our bounds, we will also assume that \(c_{\mathsf {N}}\ge 1\).

2.5 Black-box Error Assumptions for Multiplication, Inversion, and QR

Our algorithm uses matrix–matrix multiplication, matrix inversion, and QR factorization as primitives. For our analysis, we must therefore assume some bounds on the error and runtime costs incurred by these subroutines. In this section, we first formally state the kind of error and runtime bounds we require, and then discuss some implementations known in the literature that satisfy each of our requirements with modest constants.

Our definitions are inspired by the definition of logarithmic stability introduced in [22]. Roughly speaking, they say that implementing the algorithm with floating point precision \({\textbf {u }}\) yields an accuracy which is at most polynomially or quasipolynomially in n worse than \({\textbf {u }}\) (possibly also depending on the condition number in the case of inversion). Their definition has the property that while a logarithmically stable algorithm is not strictly speaking backward stable, it can attain the same forward error bound as a backward stable algorithm at the cost of increasing the bit length by a polylogarithmic factor. See Section 3 of their paper for a precise definition and a more detailed discussion of how their definition relates to standard numerical stability notions.

Definition 2.6

A \(\mu _{\mathsf {MM}}(n)\)-stable multiplication algorithm \(\mathsf {MM}(\cdot , \cdot )\) takes as input \(A,B\in \mathbb {C}^{n\times n}\) and a precision \({\textbf {u }}>0\) and outputs \(C=\mathsf {MM}(A, B)\) satisfying

$$\begin{aligned} \Vert C-AB\Vert \le \mu _{\mathsf {MM}}(n) \cdot {\textbf {u }}\Vert A\Vert \Vert B\Vert , \end{aligned}$$

on a floating point machine with precision \({\textbf {u }}\), in \(T_\mathsf {MM}(n)\) arithmetic operations.

Definition 2.7

A \((\mu _{\mathsf {INV}}(n), c_\mathsf {INV})\)-stable inversion algorithm \(\mathsf {INV}(\cdot )\) takes as input \(A\in \mathbb {C}^{n\times n}\) and a precision \({\textbf {u }}\) and outputs \(C=\mathsf {INV}(A)\) satisfying

$$\begin{aligned} \Vert C-A^{-1}\Vert \le \mu _{\mathsf {INV}}(n)\cdot {\textbf {u }}\cdot \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert , \end{aligned}$$

on a floating point machine with precision \({\textbf {u }}\), in \(T_\mathsf {INV}(n)\) arithmetic operations.

Definition 2.8

A \(\mu _\mathsf {QR}(n)\)-stable QR factorization algorithm \(\mathsf {QR}(\cdot )\) takes as input \(A\in \mathbb {C}^{n\times n}\) and a precision \({\textbf {u }}\), and outputs \([Q,R]=\mathsf {QR}(A)\) such that

  1. 1.

    R is exactly upper triangular.

  2. 2.

    There is a unitary \(Q'\) and a matrix \(A'\) such that

    $$\begin{aligned} Q' A'= R, \end{aligned}$$
    (17)

    and

    $$\begin{aligned} \Vert Q' - Q\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}, \quad \text {and} \quad \Vert A'-A\Vert \le \mu _\mathsf {QR}(n){\textbf {u }}\Vert A\Vert , \end{aligned}$$

on a floating point machine with precision \({\textbf {u }}\). Its running time is \(T_\mathsf {QR}(n)\) arithmetic operations.

Remark 2.9

Throughout this paper, to simplify some of our bounds, we will assume that

$$\begin{aligned} 1 \le \mu _\mathsf {MM}(n), \mu _{\mathsf {INV}}(n) , \mu _\mathsf {QR}(n), c_{\mathsf {INV}}\log n. \end{aligned}$$

The above definitions can be instantiated with traditional \(O(n^3)\)-complexity algorithms for which \(\mu _{\mathsf {MM}}, \mu _\mathsf {QR}, \mu _{\mathsf {INV}}\) are all O(n) and \(c_\mathsf {INV}=1\) [39]. This yields easily implementable practical algorithms with running times depending cubically on n.

In order to achieve \(O(n^\omega )\)-type efficiency, we instantiate them with fast-matrix-multiplication-based algorithms and with \(\mu (n)\) taken to be a low-degree polynomial [22]. Specifically, the following parameters are known to be achievable.

Theorem 2.10

(Fast and Stable Instantiations of \(\mathsf {MM},\mathsf {INV}, \mathsf {QR}\))

  1. 1.

    If \(\omega \) is the exponent of matrix multiplication, then for every \(\eta >0\) there is a \(\mu _{\mathsf {MM}}(n)\)-stable multiplication algorithm with \(\mu _{\mathsf {MM}}(n)=n^{c_\eta }\) and \(T_\mathsf {MM}(n)=O(n^{\omega +\eta })\), where \(c_\eta \) does not depend on n.

  2. 2.

    Given an algorithm for matrix multiplication satisfying (1), there is a (\(\mu _{\mathsf {INV}}(n),c_\mathsf {INV})\) -stable inversion algorithm with

    $$\begin{aligned} \mu _{\mathsf {INV}}(n)\le O(\mu _{\mathsf {MM}}(n)n^{\lg (10)}),\quad \quad c_\mathsf {INV}\le 8, \end{aligned}$$

    and \(T_\mathsf {INV}(n)\le T_\mathsf {MM}(3n)=O(T_\mathsf {MM}(n))\).

  3. 3.

    Given an algorithm for matrix multiplication satisfying (1), there is a \(\mu _\mathsf {QR}(n)\)-stable QR factorization algorithm with

    $$\begin{aligned} \mu _\mathsf {QR}(n)=O(n^{c_\mathsf {QR}} \mu _{\mathsf {MM}}(n)), \end{aligned}$$

    where \(c_\mathsf {QR}\) is an absolute constant, and \(T_\mathsf {QR}(n)=O(T_\mathsf {MM}(n))\).

In particular, all of the running times above are bounded by \(T_\mathsf {MM}(n)\) for an \(n\times n\) matrix.

Proof

(1) is Theorem 3.3 of [23]. (2) is Theorem 3.3 (see also equation (9) above its statement) of [22]. The final claim follows by noting that \(T_\mathsf {MM}(3n)=O(T_\mathsf {MM}(n))\) by dividing a \(3n\times 3n\) matrix into nine \(n\times n\) blocks and proceeding blockwise, at the cost of a factor of 9 in \(\mu _{\mathsf {INV}}(n)\). (3) appears in Section 4.1 of [22]. \(\square \)

We remark that for specific existing fast matrix multiplication algorithms such as Strassen’s algorithm, specific small values of \(\mu _\mathsf {MM}(n)\) are known (see [23] and its references for details), so these may also be used as a black box, though we will not do this in this paper.

3 Pseudospectral Shattering

This section is devoted to our central probabilistic result, Theorem 1.4, and the accompanying notion of pseudospectral shattering which will be used extensively in our analysis of the spectral bisection algorithm in Sect. 5.

3.1 Smoothed Analysis of Gap and Eigenvector Condition Number

As is customary in the literature, we will refer to an \(n\times n\) random matrix \(G_n\) whose entries are independent complex Gaussians drawn from \(\mathcal {N}(0,1_\mathbb {C}/n)\) as a normalized complex Ginibre random matrix. To be absolutely clear, and because other choices of scaling are quite common, we mean that \(\mathbb {E}G_{i,j} = 0\) and \(\mathbb {E}|G_{i,j}|^2 = 1/n\).

In the course of proving Theorem 1.4, we will need to bound the probability that the second-smallest singular value of an arbitrary matrix with small Ginibre perturbation is atypically small. We begin with a well-known lower tail bound on the singular values of a Ginibre matrix alone.

Theorem 3.1

([61,  Theorem 1.2]) For \(G_n\) an \(n\times n\) normalized complex Ginibre matrix and for any \(\alpha \ge 0\) it holds that

$$\begin{aligned} \mathbb {P}\left[ \sigma _j(G_n) < \frac{\alpha (n-j+1)}{n} \right] \le \left( \sqrt{2e} \, \alpha \right) ^{2(n-j+1)^2}. \end{aligned}$$

As in several of the authors’ earlier work [9], we can transfer this result to case of a Ginibre perturbation via a remarkable coupling result of P. Śniady.

Theorem 3.2

(Śniady [58]) Let \(A_1\) and \(A_2\) be \(n \times n\) complex matrices such that \(\sigma _i(A_1) \le \sigma _i(A_2)\) for all \(1 \le i \le n\). Assume further that \(\sigma _i(A_1) \ne \sigma _j(A_1)\) and \(\sigma _i(A_2) \ne \sigma _j(A_2)\) for all \(i \ne j\). Then for every \(t \ge 0\), there exists a joint distribution on pairs of \(n \times n\) complex matrices \((G_1, G_2)\) such that

  1. 1.

    the marginals \(G_1\) and \(G_2\) are distributed as normalized complex Ginibre matrices, and

  2. 2.

    almost surely \(\sigma _i(A_1 + \sqrt{t} G_1) \le \sigma _i(A_2 + \sqrt{t} G_2)\) for every i.

Corollary 3.3

For any fixed matrix M and parameters \(\gamma , t>0\)

$$\begin{aligned} \mathbb {P}[\sigma _{n-1}(M+\gamma G_n)<t]\le (e/2)^4 (tn/\gamma )^8 \le 4(tn/\gamma )^8. \end{aligned}$$

Proof

We would like to apply Theorem 3.2 to \(A_1=0\) and \(A_2=M\), but the theorem has the technical condition that \(A_1\) and \(A_2\) have distinct singular values. Taking vanishingly small perturbations of 0 and M satisfying this condition and taking the size of the perturbation to zero, we obtain

$$\begin{aligned} \mathbb {P}[\sigma _{n-1}(M+\gamma G_n)<t] \le \mathbb {P}[\sigma _{n-1}(\gamma G_n)<t] = \mathbb {P}[\sigma _{n-1}(G_n)<t/\gamma ]. \end{aligned}$$

Invoking Theorem 3.1 with \(j=n-1\) and \(\alpha \) replaced by \(tn/2\gamma \) yields the claim.

\(\square \)

We will need as well the main theorem of [9], which shows that the addition of a small complex Ginibre to an arbitrary matrix tames its eigenvalue condition numbers.

Theorem 3.4

([9,  Theorem 1.5]) Suppose \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\) and \(\delta \in (0,1)\). Let \(G_n\) be a complex Ginibre matrix, and let \(\lambda _1,\ldots ,\lambda _n\in \mathbb {C}\) be the (random) eigenvalues of \(A+\delta G_n\). Then for every measurable open set \(B\subset \mathbb {C},\)

$$\begin{aligned} \mathbb {E}\sum _{\lambda _i \in B} \kappa (\lambda _i)^2 \le \frac{n^2}{\pi \delta ^2}\mathrm {vol}(B). \end{aligned}$$

Our final lemma before embarking on the proof in earnest shows that bounds on the j-th smallest singular value and eigenvector condition number are sufficient to rule out the presence of j eigenvalues in a small region. For our particular application, we will take \(j=2\).

Lemma 3.5

Let \(D(z_0,r) :=\{z \in \mathbb {C} :|z-z_0|<r\}\). If \(M\in \mathbb {C}^{n\times n}\) is a diagonalizable matrix with at least j eigenvalues in \(D(z_0,r)\) then

$$\begin{aligned} \sigma _{n-j+1}(z_0-M) \le r\kappa _V(M). \end{aligned}$$

Proof

Write \(M=VDV^{-1}\). By Courant–Fischer:

$$\begin{aligned} \sigma _{n-j+1}(z_0-M)&= \min _{S:\dim (S)=j}\max _{x\in S\setminus \{0\}} \frac{\Vert V(z_0-D)V^{-1}x\Vert }{\Vert x\Vert }&\\&=\min _{S:\dim (S)=j}\max _{y\in V(S)\setminus \{0\}} \frac{\Vert V(z_0-D)y\Vert }{\Vert Vy\Vert }&\text { setting }y=Vx\\&=\min _{S:\dim (S)=j}\max _{y\in S\setminus \{0\}} \frac{\Vert V(z_0-D)y\Vert }{\Vert Vy\Vert }&\text { since }V \text { is invertible}\\&\le \min _{S:\dim (S)=j}\max _{y\in S\setminus \{0\}} \frac{\Vert V\Vert \Vert (z_0-D)y\Vert }{\sigma _n(V)\Vert y\Vert }&\\&\le \kappa _V(M)\sigma _{n-j+1}(z_0-D).&\end{aligned}$$

Since \(z_0-D\) is diagonal its singular values are just \(|z_0-\lambda _i|\), so the j-th smallest is at most r, finishing the proof. \(\square \)

We now present the main tail bound that we use to control the minimum gap and eigenvector condition number.

Theorem 3.6

(Multiparameter Tail Bound) Let \(A\in \mathbb {C}^{n\times n}\). Assume \(\Vert A\Vert \le 1\) and \(\gamma <1/2\), and let \(X:=A+\gamma G_n\) where \(G_n\) is a complex Ginibre matrix. For every \(t,r>0\):

$$\begin{aligned}&\mathbb {P}\big [\kappa _V(X)<t,\, \mathrm {gap}(X)>r, \, \Vert G_n\Vert <4\big ]\ge 1\nonumber \\&\quad -\left( \frac{144}{r^2}\cdot 4(trn/\gamma )^8 + (9n^3/\gamma ^2t^2)+2e^{-2 n}\right) . \end{aligned}$$
(18)

Proof

Write \(\Lambda (X):=\{\lambda _1,\ldots ,\lambda _n\}\) for the (random) eigenvalues of \(X:=A+\gamma G_n\), in increasing order of magnitude (there are no ties almost surely). Let \(\mathcal {N}\subset \mathbb {C}\) be a minimal r/2-net of \(B:=D(0,3)\), recalling the standard fact that one exists of size no more than \((3\cdot 4/r)^2=144/r^2\). The most useful feature of such a net is that, by the triangle inequality, for any \(a,b \in D(0,3)\) with distance at most r, there is a point \(y\in \mathcal {N}\) with \(|y-(a+b)/2|<r/2\) satisfying \(a,b\in D(y,r)\). In particular, if \(\mathrm {gap}(X) < r\), then there are two eigenvalues in the disk of radius r centered at some point \(y \in \mathcal {N}\).

Therefore, consider the events

$$\begin{aligned} E_\mathrm {gap}&:= \{\mathrm {gap}(X)<r\} \subset \{\exists y\in \mathcal {N}: |D(y,r)\cap \Lambda (X)|\ge 2\} \\ E_D&:= \{\Lambda (X)\not \subseteq D(0,3)\}\subset \{\Vert G_n\Vert \ge 4\}:=E_G \\ E_\kappa&:= \{\kappa _V(X) > t\} \\ E_y&:=\{\sigma _{n-1}(y-X) < rt\},\qquad y\in \mathcal {N}. \end{aligned}$$

Lemma 3.5 applied to each \(y\in \mathcal {N}\) with \(j=2\) reveals that

$$\begin{aligned} E_\mathrm {gap}\subseteq E_D\cup E_\kappa \cup \bigcup _{y\in \mathcal {N}} E_y, \end{aligned}$$

whence

$$\begin{aligned} E_\mathrm {gap}\cup E_\kappa \subseteq E_D\cup E_\kappa \cup \bigcup _{y\in \mathcal {N}} E_y. \end{aligned}$$

By a union bound, we have

$$\begin{aligned} \mathbb {P}[E_\mathrm {gap}\cup E_\kappa ] \le \mathbb {P}[E_D\cup E_\kappa ]+|\mathcal {N}|\max _{y\in \mathcal {N}} \mathbb {P}[E_y]. \end{aligned}$$
(19)

From the tail bound on the operator norm of a Ginibre matrix in [9,  Lemma 2.2],

$$\begin{aligned} \mathbb {P}[E_D] \le \mathbb {P}[E_G]\le 2e^{-(4-2\sqrt{2})^2n}\le 2e^{-2 n}. \end{aligned}$$
(20)

Observe that by (11),

$$\begin{aligned} \left\{ \kappa _V(X) > \sqrt{n\sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2}\right\} \subset E_D, \end{aligned}$$

since the inequality in the left-hand event must reverse when we sum over all \(\lambda _i \in \Lambda (X)\); thus,

$$\begin{aligned} E_\kappa \subset E_D \cup \left\{ \sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 > t^2/n\right\} . \end{aligned}$$

Theorem 3.4 and Markov’s inequality yield

$$\begin{aligned} \mathbb {P}\left[ \sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 > t^2/n\right] \le \mathbb {E}\sum _{\lambda _i\in D(0,3)}\kappa (\lambda _i)^2 \frac{n}{t^2} \le \frac{9\pi n^2}{\pi \gamma ^2} \frac{n}{t^2} = \frac{9n^3}{t^2\gamma ^2}. \end{aligned}$$

Thus, we have

$$\begin{aligned} \mathbb {P}[E_\kappa \cup E_D]\le \frac{9n^3}{t^2\gamma ^2} + 2e^{-2 n}. \end{aligned}$$

Corollary 3.3 applied to \(M=-y+A\) gives the bound

$$\begin{aligned} \mathbb {P}[E_y]\le 4\left( \frac{trn}{\gamma }\right) ^8, \end{aligned}$$

for each \(y\in \mathcal {N}\), and plugging these estimates back into (19) we have

$$\begin{aligned} \mathbb {P}[E_\mathrm {gap}\cup E_\kappa \cup E_D] \le \mathbb {P}[E_\mathrm {gap}\cup E_\kappa \cup E_G]\le \frac{144}{r^2}\cdot 4\left( \frac{trn}{\gamma }\right) ^8 + \frac{9n^3}{\gamma ^2t^2}+2e^{-2 n}, \end{aligned}$$

as desired. \(\square \)

A specific setting of parameters in Theorem 3.6 immediately yields Theorem 1.4.

Proof of Theorem 1.4

Applying Theorem 3.6 with parameters \( t:=\frac{n^2}{\gamma }\) and \(r := \frac{\gamma ^4}{n^5}\), we have

$$\begin{aligned} \mathbb {P}\big [\mathrm {gap}(X)>r,\, \kappa _V(X)<t,\, \Vert G\Vert \le 4\big ]\ge & {} 1-{600} \frac{n^{10}}{\gamma ^8}\left( \frac{\gamma ^2}{n^2}\right) ^8 \nonumber \\&- \frac{9}{n^2}-2e^{-2 n}\ge 1-12/n, \end{aligned}$$
(21)

as desired, where in the last step we use the assumption \(\gamma < 1/2\). \(\square \)

Since it is of independent interest in random matrix theory, we record the best bound on the gap alone that is possible to extract from the theorem above.

Corollary 3.7

(Minimum Gap Bound)

For X as in Theorem 3.6,

$$\begin{aligned} \mathbb {P}[\mathrm {gap}(X)<r]\le 2\cdot 9^{4/5}(144\cdot 4)^{1/5}(n/\gamma )^{2+6/5}r^{6/5}\le 42(n/\gamma )^{3.2}r^{1.2}+ 2e^{-2 n}. \end{aligned}$$

In particular, the probability is o(1) if \(r=o((\gamma /n)^{8/3})\).

Proof

Setting

$$\begin{aligned} t^{10} = \frac{9}{144\cdot 4}(\gamma /nr)^6 \end{aligned}$$

in Theorem 3.6 balances the first two terms and yields the advertised bound. \(\square \)

3.2 Shattering

Propositions 2.2 and 2.3 in the preliminaries together tell us that if the \(\epsilon \)-pseudospectrum of an \(n\times n\) matrix A has n connected components, then each eigenvalue of any size-\(\epsilon \) perturbation \({\widetilde{A}}\) will lie in its own connected component of \(\Lambda _\epsilon (A)\). The following key definitions make this phenomenon quantitative in a sense which is useful for our analysis of spectral bisection.

Definition 3.8

(Grid) A grid in the complex plane consists of the boundaries of a lattice of squares with lower edges parallel to the real axis. We will write

$$\begin{aligned} \mathsf {grid}(z_0,\omega ,s_1,s_2) \subset \mathbb {C}\end{aligned}$$

to denote an \(s_1\times s_2\) grid of \(\omega \times \omega \)-sized squares and lower left corner at \(z_0 \in \mathbb {C}\). Write \({{\,\mathrm{diam}\,}}(\mathsf {g}) := \omega \sqrt{s_1^2 + s_2^2}\) for the diameter of the grid.

Definition 3.9

(Shattering) A pseudospectrum \(\Lambda _\epsilon (A)\) is shattered with respect to a grid \(\mathsf {g}\) if:

  1. 1.

    Every square of \(\mathsf {g}\) has at most one eigenvalue of A.

  2. 2.

    \(\Lambda _\epsilon (A)\cap \mathsf {g}=\emptyset \).

Observation 3.10

As \(\Lambda _\epsilon (A)\) contains a ball of radius \(\epsilon \) about each eigenvalue of A, shattering of the \(\epsilon \)-pseudospectrum with respect to a grid with side length \(\omega \) implies \(\epsilon \le \omega /2\).

As a warm-up for more sophisticated arguments later on, we give here an easy consequence of the shattering property.

Lemma 3.11

If \(\lambda _1, \dots , \lambda _n\) are the eigenvalues of A, and \(\Lambda _\epsilon (A)\) is shattered with respect to a grid \(\mathsf {g}\) with side length \(\omega \), then every eigenvalue condition number satisfies \(\kappa (\lambda _i) \le \frac{2\omega }{\pi \epsilon }\).

Proof

Let \(v,w^*\) be a right/left eigenvector pair for some eigenvalue \(\lambda _i\) of A, normalized so that \(w^*v = 1\). Letting \(\Gamma \) be the positively oriented boundary of the square of \(\mathsf {g}\) containing \(\lambda _i\), we can extract the projector \(vw^*\) by integrating, and pass norms inside the contour integral to obtain

$$\begin{aligned} \kappa (\lambda _i)&= \Vert vw^*\Vert = \left\| \frac{1}{2\pi i}\oint _\Gamma (z - A)^{-1}d z \right\| \le \frac{1}{2\pi }\oint _\Gamma \left\| (z - A)^{-1}\right\| d z \le \frac{2\omega }{\pi \epsilon }.\nonumber \\ \end{aligned}$$
(22)

In the final step, we have used the fact that, given the definition of pseudospectrum (6) above, \(\Lambda _\epsilon (A) \cap \mathsf {g}= \emptyset \) means \(\Vert (z - A)^{-1}\Vert \le 1/\epsilon \) on \(\mathsf {g}\). \(\square \)

The theorem below quantifies the extent to which perturbing by a Ginibre matrix results in a shattered pseudospectrum. See Fig. 1 for an illustration in the case where the initial matrix is poorly conditioned. In general, not all eigenvalues need move so far upon such a perturbation, in particular if the respective \(\kappa _i\) are small.

Fig. 1
figure 1

T is a sample of an upper triangular \(10\times 10\) Toeplitz matrix with zeros on the diagonal and an independent standard real Gaussian repeated along each diagonal above the main diagonal. G is a sample of a \(10\times 10\) complex Ginibre matrix with unit variance entries. Using the MATLAB package EigTool [67], the boundaries of the \(\epsilon \)-pseudospectrum of T (left) and \(T+10^{-6} G\) (right) for \(\epsilon = 10^{-6}\) are plotted along with the spectra. The latter pseudospectrum is shattered with respect to the pictured grid

Theorem 3.12

(Exact Arithmetic Shattering) Let \(A\in {\mathbb {C}}^{n\times n}\) and \(X:=A+\gamma G_n\) for \(G_n\) a complex Ginibre matrix. Assume \(\Vert A\Vert \le 1\) and \(0< \gamma < 1/2\). Let \(\mathsf {g}:= \mathsf {grid}(z, \omega ,\lceil 8/\omega \rceil , \lceil 8/\omega \rceil )\) with \(\omega := \frac{\gamma ^4}{4n^5}\), and z chosen uniformly at random from the square of side \(\omega \) cornered at \(-4-4i\). Then, \(\kappa _V(X)\le n^2/\gamma \), \(\Vert A-X\Vert \le 4\gamma \), and \(\Lambda _\epsilon (X)\) is shattered with respect to \(\mathsf {g}\) for

$$\begin{aligned} \epsilon := \frac{\gamma ^5}{16n^9}, \end{aligned}$$

with probability at least \(1-13/n\).

Proof

Condition on the event in Theorem 1.4, so that

$$\begin{aligned} \kappa _V(X)\le \frac{n^2}{\gamma },\quad \Vert X-A\Vert \le 4\gamma ,\quad \text { and } \mathrm {gap}(X)\ge \frac{\gamma ^4}{n^5}=4\omega . \end{aligned}$$

Consider the random grid \(\mathsf {g}\). Since D(0, 3) is contained in the square of side length 8 centered at the origin, every eigenvalue of X is contained in one square of \(\mathsf {g}\) with probability 1. Moreover, since \(\mathrm {gap}(X)>4\omega \), no square can contain two eigenvalues. Let

$$\begin{aligned} \mathsf {dist}_\mathsf {g}(z):=\min _{y\in \mathsf {g}}|z-y|. \end{aligned}$$

Let \(\lambda _i := \lambda _i(X)\). We now have for each \(\lambda _i\) and every \(s < \frac{\omega }{2}\) :

$$\begin{aligned} \mathbb {P}[\mathsf {dist}_\mathsf {g}(\lambda _i)>s] = \frac{(\omega -2s)^2}{\omega ^2} = 1- \frac{4 s}{\omega }+ \frac{4s^2 }{\omega ^2} \ge 1-\frac{4s}{\omega }, \end{aligned}$$

since the distribution of \(\lambda _i\) inside its square is uniform with respect to Lebesgue measure. Setting \(s=\omega /4n^2\), this probability is at least \(1-1/n^2\), so by a union bound

$$\begin{aligned} \mathbb {P}[\min _{i\le n} \mathsf {dist}_\mathsf {g}(\lambda _i)>\omega /4n^2]>1-1/n, \end{aligned}$$
(23)

i.e., every eigenvalue is well-separated from \(\mathsf {g}\) with probability \(1-1/n\).

We now recall from (12) that

$$\begin{aligned} \Lambda _\epsilon (X)\subset \bigcup _{i\le n} D(\lambda _i, \kappa _V(X)\epsilon ). \end{aligned}$$

Thus, on the events (21) and (23), we see that \(\Lambda _\epsilon (X)\) is shattered with respect to \(\mathsf {g}\) as long as

$$\begin{aligned} \kappa _V(X)\epsilon < \frac{\omega }{4n^2}, \end{aligned}$$

which is implied by

$$\begin{aligned} \epsilon < \frac{\gamma ^4}{4n^5}\cdot \frac{1}{4n^2} \cdot \frac{\gamma }{n^2} =\frac{\gamma ^5}{16n^9}. \end{aligned}$$

Thus, the advertised claim holds with probability at least

$$\begin{aligned} 1-\frac{1}{n}- \frac{13}{n} = 1 - \frac{13}{n} , \end{aligned}$$

as desired. \(\square \)

Finally, we show that the shattering property is retained when the Gaussian perturbation is added in finite precision rather than exactly. This also serves as a pedagogical warm-up for our presentation of more complicated algorithms later in the paper: we use E to represent an adversarial roundoff error (as in step 2), and for simplicity neglect roundoff error completely in computations whose size does not grow with n (such as steps 3 and 4, which set scalar parameters).

figure a

Theorem 3.13

(Finite Arithmetic Shattering) Assume there is a \(c_{\mathsf {N}}\)-stable Gaussian sampling algorithm \(\mathsf {N}\) satisfying the requirements of Definition 2.5. Then \(\mathsf {SHATTER}\) has the advertised guarantees as long as the machine precision satisfies

$$\begin{aligned} {\textbf {u }}\le \frac{1}{2}\frac{\gamma ^5}{16n^9}\cdot \frac{1}{(3+c_{\mathsf {N}})\sqrt{n}}, \end{aligned}$$
(24)

and runs in

$$\begin{aligned} n^2T_\mathsf {N}+ n^2=O(n^2) \end{aligned}$$

arithmetic operations.

Proof

The two sources of error in \(\mathsf {SHATTER}\) are:

  1. 1.

    An additive error of operator norm at most \(n\cdot c_{\mathsf {N}}\cdot (1/\sqrt{n})\cdot {\textbf {u }}\le c_{\mathsf {N}}\sqrt{n}\cdot {\textbf {u }}\) from \(\mathsf {N}\), by Definition 2.5.

  2. 2.

    An additive error of norm at most \(\sqrt{n}\cdot \Vert X\Vert \cdot {\textbf {u }}\le 3\sqrt{n}{\textbf {u }}\), with probability at least \(1-1/n\), from the roundoff E in step 2.

Thus, as long as the precision satisfies (24), we have

$$\begin{aligned} \Vert \mathsf {SHATTER}(A,\gamma )-\mathrm {shatter}(A,\gamma )\Vert \le \frac{1}{2}\frac{\gamma ^5}{16n^9}, \end{aligned}$$

where \(\mathrm {shatter}(\cdot )\) refers to the (exact arithmetic) outcome of Theorem 3.12. The correctness of \(\mathsf {SHATTER}\) now follows from Proposition 2.2. Its running time is bounded by

$$\begin{aligned} n^2T_\mathsf {N}+ n^2 \end{aligned}$$

arithmetic operations, as advertised. \(\square \)

4 Matrix Sign Function

The algorithmic centerpiece of this work is the analysis, in finite arithmetic, of a well-known iterative method for approximating to the matrix sign function. Recall from Sect. 1 that if A is a matrix whose spectrum avoids the imaginary axis, then

$$\begin{aligned} \mathrm {sgn}(A) = P_+ - P_- \end{aligned}$$

where the \(P_+\) and \(P_-\) are the spectral projectors corresponding to eigenvalues in the open right and left half-planes, respectively. The iterative algorithm we consider approximates the matrix sign function by repeated application to A of the function

$$\begin{aligned} g(z) := \frac{1}{2}(z + z^{-1}). \end{aligned}$$
(25)

This is simply Newton’s method to find a root of \(z^2 - 1\), but one can verify that the function g fixes the left and right half-planes, and thus we should expect it to push those eigenvalues in the former towards \(-1\), and those in the latter towards \(+1\).

We denote the specific finite-arithmetic implementation used in our algorithm by \(\mathsf {SGN}\); the pseudocode is provided below.

figure b

In Sect. 4.1 we briefly discuss the specific preliminaries that will be used throughout this section. In Sect. 4.2 we give a pseudospectral proof of the rapid global convergence of this iteration when implemented in exact arithmetic. In Sect. 4.2 we show that the proof provided in Sect. 4.3 is robust enough to handle the finite arithmetic case; a formal statement of this main result is the content of Theorem 4.9.

4.1 Circles of Apollonius

It has been known since antiquity that a circle in the plane may be described as the set of points with a fixed ratio of distances to two focal points. By fixing the focal points and varying the ratio in question, we get a family of circles named for the Greek geometer Apollonius of Perga. We will exploit several interesting properties enjoyed by these Circles of Apollonius in the analysis below.

More precisely, we analyze the Newton iteration map \(g\) in terms of the family of Apollonian circles whose foci are the points \(\pm 1 \in \mathbb {C}\). For the remainder of this section we will write \(m(z) = \tfrac{1 - z}{1 + z}\) for the Möbius transformation taking the right half-plane to the unit disk, and for each \(\alpha \in (0,1)\) we denote by

$$\begin{aligned} \mathsf {C}^{+}_{\alpha } = \left\{ z \in \mathbb {C}: |m(z)| \le \alpha \right\} , \quad \mathsf {C}^{-}_{\alpha } = \{ z \in {\mathbb {C}} : |m(z)|^{-1} \le \alpha \} \end{aligned}$$

the closed region in the right (respectively left) half-plane bounded by such a circle. Write \(\partial \mathsf {C}^{+}_{\alpha }\) and \(\partial \mathsf {C}^{-}_{\alpha }\) for their boundaries, and \(\mathsf {C}_{\alpha } = \mathsf {C}^{+}_{\alpha } \cup \mathsf {C}^{-}_{\alpha }\) for their union. See Fig. 2 for an illustration.

Fig. 2
figure 4

Apollonian circles appearing in the analysis of the Newton iteration. Depicted are \(\partial \mathsf {C}^{+}_{\alpha ^{2^{k}}}\) for \(\alpha =0.8\) and \(k = 0, 1, 2, 3\), with smaller circles corresponding to larger k

The region \(\mathsf {C}^{+}_{\alpha }\) is a disk centered at \(\tfrac{1 + \alpha ^2}{1 - \alpha ^2} \in \mathbb {R}\), with radius \(\tfrac{2\alpha }{1-\alpha ^2}\), and whose intersection with the real line is the interval \((m(\alpha ),m(\alpha )^{-1})\); \(\mathsf {C}^{-}_{\alpha }\) can be obtained by reflecting \(\mathsf {C}^{+}_{\alpha }\) with respect to the imaginary axis. For \(\alpha> \beta > 0\), we will write

$$\begin{aligned} \mathsf {A}^{+}_{\alpha ,\beta } = \mathsf {C}^{+}_{\alpha } \setminus \mathsf {C}^{+}_{\beta } \end{aligned}$$

for the Apollonian annulus lying inside \(\mathsf {C}^{+}_{\alpha }\) and outside \(\mathsf {C}^{+}_{\beta }\); note that the circles are not concentric so this is not strictly speaking an annulus, and note also that in our notation this set does not include \(\partial \mathsf {C}^{+}_{\beta }\). In the same way define \(\mathsf {A}^{-}_{\alpha ,\beta }\) for the left half-plane and write \(\mathsf {A}_{\alpha ,\beta } = \mathsf {A}^{+}_{\alpha ,\beta } \cup \mathsf {A}^{-}_{\alpha ,\beta }\).

Observation 4.1

([53]) The Newton map \(g\) is a two-to-one map from \(\mathsf {C}^{+}_{\alpha }\) to \(\mathsf {C}^{+}_{\alpha ^2}\), and a two-to-one map from \(\mathsf {C}^{-}_{\alpha }\) to \(\mathsf {C}^{-}_{\alpha ^2}\).

Proof

This follows from the fact that for each z in the right half-plane,

$$\begin{aligned} |m(g(z))| = \left| \frac{1 - \tfrac{1}{2}(z + 1/z)}{1 + \tfrac{1}{2}(z + 1/z)}\right| = \left| \frac{(1-z)^2}{(z + 1)^2}\right| = |m(z)|^2 \end{aligned}$$

and similarly for the left half-plane. \(\square \)

It follows from Observation 4.1 that under repeated application of the Newton map g, any point in the right or left half-plane converges to \(+1\) or \(-1\), respectively.

4.2 Exact Arithmetic

In this section, we set \(A_0 := A\) and \(A_{k+1} := g(A_k)\) for all \(k \ge 0\). In the case of exact arithmetic, Observation 4.1 implies global convergence of the Newton iteration when A is diagonalizable. For the convenience of the reader, we provide this argument (due to [53]) below.

Proposition 4.2

Let A be a diagonalizable \(n \times n\) matrix and assume that \(\Lambda (A) \subset \mathsf {C}_{\alpha }\) for some \(\alpha \in (0,1)\). Then for every \(N \in {\mathbb {N}}\) we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le \frac{4\alpha ^{2^N}}{\alpha ^{2^{N+1}}+1} \cdot \kappa _V(A). \end{aligned}$$

Moreover, when A does not have eigenvalues on the imaginary axis the minimum \(\alpha \) for which \(\Lambda (A) \subset \mathsf {C}_{\alpha }\) is given by

$$\begin{aligned} \alpha ^2 = \max _{1 \le i \le n} \left\{ 1 - \frac{4|{\text {Re}}(\lambda _i(A))|}{|\lambda _i(A)-\mathrm {sgn}(\lambda _i(A))|^2} \right\} \end{aligned}$$

Proof

Consider the spectral decomposition \(A = \sum _{i=1}^n \lambda _i v_i w_i^*,\) and denote by \(\lambda _i^{(N)}\) the eigenvalues of \(A_N\).

By Observation 4.1, we have that \(\Lambda (A_N) \subset \mathsf {C}_{\alpha ^{2^N}}\) and \(\mathrm {sgn}(\lambda _i) = \mathrm {sgn}(\lambda _i^{(N)})\). Moreover, \(A_N\) and \(\mathrm {sgn}(A)\) have the same eigenvectors. Hence

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le \left\| \sum _{{\text {Re}}(\lambda _i) > 0} (\lambda _i^{(N)}-1) v_i w_i^* \right\| + \left\| \sum _{{\text {Re}}(\lambda _i) < 0} (\lambda _i^{(N)}+1) v_i w_i^* \right\| . \end{aligned}$$
(26)

Now we will use that for any matrix X we have that \(\Vert X \Vert \le \kappa _V(X) \mathsf {spr}(X)\) where \(\mathsf {spr}(X)\) denotes the spectral radius of X. Observe that the spectral radii of the two matrices appearing on the right-hand side of (26) are bounded by \(\max _{i} |\lambda _i- \mathrm {sgn}(\lambda _i)|\), which in turn is bounded by the radius of the circle \(\mathsf {C}^{+}_{\alpha ^{2^N}}\), namely \(2\alpha ^{2^N}/(\alpha ^{2^{N+1}}+1)\). On the other hand, the eigenvector condition number of these matrices is bounded by \(\kappa _V(A)\). This concludes the first part of the statement.

In order to compute \(\alpha \) note that if \(z = x+ i y\) with \( x > 0\), then

$$\begin{aligned} |m(z)|^2 = \frac{(1-x)^2+ y^2}{(1+x)^2+ y^2} = 1- \frac{4x}{(1+x)^2+y^2}, \end{aligned}$$

and analogously when \(x < 0\) and we evaluate \(|m(z)|^{-2}\). \(\square \)

The above analysis becomes useless when trying to prove the same statement in the framework of finite arithmetic. This is due to the fact that at each step of the iteration the roundoff error can make the eigenvector condition numbers of the \(A_k\) grow. In fact, since \(\kappa _V(A_k)\) is sensitive to infinitesimal perturbations whenever \(A_k\) has a multiple eigenvalue, it seems difficult to control it against adversarial perturbations as the iteration converges to \(\mathrm {sgn}(A_k)\) (which has very high multiplicity eigenvalues). A different approach, also due to [53], yields a proof of convergence in exact arithmetic even when A is not diagonalizable. However, that proof relies heavily on the fact that \(m(A_N)\) is an exact power of \(m(A_0)\), or more precisely, it requires the sequence \(A_k\) to have the same generalized eigenvectors, which is again not the case in the finite arithmetic setting.

Therefore, a robust version, tolerant to perturbations, of the above proof is needed. To this end, instead of simultaneously keeping track of the eigenvector condition number and the spectrum of the matrices \(A_k\), we will just show that for certain \(\epsilon _k > 0\), the \(\epsilon _k\)-pseudospectra of these matrices are contained in a certain shrinking region dependent on k. This invariant is inherently robust to perturbations smaller than \(\epsilon _k\), unaffected by clustering of eigenvalues due to convergence, and allows us to bound the accuracy and other quantities of interest via the functional calculus. For example, the following lemma shows how to obtain a bound on \(\Vert A_N- \mathrm {sgn}(A)\Vert \) solely using information from the pseudospectrum of \(A_N\).

Lemma 4.3

(Pseudospectral Error Bound) Let A be any \(n \times n\) matrix and let \(A_N\) be the Nth iterate of the Newton iteration under exact arithmetic. Assume that \(\epsilon _N > 0\) and \(\alpha _N \in (0, 1)\) satisfy \(\Lambda _{\epsilon _N}(A_N) \subset \mathsf {C}_{\alpha _N}\). Then, we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A)\Vert \le \frac{8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N}. \end{aligned}$$
(27)

Proof

Note that \(\mathrm {sgn}(A) = \mathrm {sgn}(A_N)\). Using the functional calculus, we get

$$\begin{aligned}&\Vert A_N - \mathrm {sgn}(A_N) \Vert = \left\| \frac{1}{2\pi i} \oint _{\partial \mathsf {C}_{\alpha _N}} z(z-A_N)^{-1}\,dz \right. \\&\qquad \left. - \frac{1}{2\pi i}\left( \oint _{\partial \mathsf {C}^{+}_{\alpha _N}} (z-A_N)^{-1}\,dz - \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} (z-A_N)^{-1}\,dz\right) \right\| \\&\quad = \left\| \frac{1}{2\pi i}\oint _{\partial \mathsf {C}^{+}_{\alpha _N}} z (z-A_N)^{-1} - (z-A_N)^{-1}\,dz \right. \\&\qquad \left. + \frac{1}{2\pi i} \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} z (z-A_N)^{-1} + (z-A_N)^{-1}\,dz \right\| \\&\quad \le \frac{1}{2\pi } \left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _N}} (z - 1) (z-A_N)^{-1} \,dz \right\| + \frac{1}{2\pi } \left\| \oint _{\partial \mathsf {C}^{-}_{\alpha _N}} (z + 1) (z-A_N)^{-1} \,dz\right\| \\&\quad \le 2 \cdot \frac{1}{2\pi } \ell (\partial \mathsf {C}^{+}_{\alpha _{N}} ) \sup \{|z - 1| : z \in \mathsf {C}^{+}_{\alpha _N}\} \frac{1}{\epsilon _N} \\&\quad = \frac{ 4 \alpha _N}{1 - \alpha _N^2} \left( \frac{1+\alpha _N}{1-\alpha _N} - 1\right) \frac{1}{\epsilon _N} \\&\quad = \frac{ 8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N}. \end{aligned}$$

\(\square \)

In view of Lemma 4.3, we would now like to find sequences \(\alpha _k\) and \(\epsilon _k\) such that

$$\begin{aligned} \Lambda _{\epsilon _k}(A_k)\subset \mathsf {C}_{\alpha _k} \end{aligned}$$

and \(\alpha _k^2/\epsilon _k\) converges rapidly to zero. The dependence of this quantity on the square of \(\alpha _k\) turns out to be crucial. As we will see below, we can find such a sequence with \(\epsilon _k\) shrinking roughly at the same rate as \(\alpha _k\). This yields quadratic convergence, which will be necessary for our bound on the required machine precision in the finite arithmetic analysis of Sect. 4.3.

The lemma below is instrumental in determining the sequences \(\alpha _k, \epsilon _k\).

Lemma 4.4

(Key Lemma) If \(\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }\), then for every \(\alpha '>\alpha ^2\), we have \(\Lambda _{\epsilon '}(g(A))\subset \mathsf {C}_{\alpha '}\) where

$$\begin{aligned} \epsilon ' := \epsilon \, \frac{(\alpha ' - \alpha ^2)(1-\alpha ^2)}{8\alpha }. \end{aligned}$$

Proof

From the definition of pseudospectrum, our hypothesis implies \(\Vert (z - A)^{-1}\Vert < 1/\epsilon \) for every z outside of \(\mathsf {C}_{\alpha }\). The proof will hinge on the observation that, for each \(\alpha ' \in (\alpha ^2,\alpha )\), this resolvent bound allows us to bound the resolvent of \(g(A)\) everywhere in the Apollonian annulus \(\mathsf {A}_{\alpha ,\alpha '}\).

Let \(w \in \mathsf {A}_{\alpha ,\alpha '}\); see Fig. 3 for an illustration. We must show that \(w \not \in \Lambda _{\epsilon '}(g(A))\). Since \(w \not \in \mathsf {C}_{\alpha ^2}\), Observation 4.1 ensures no \(z \in \mathsf {C}_{\alpha }\) satisfies \(g(z) = w\); in other words, the function \((w - g(z))^{-1}\) is holomorphic in z on \(\mathsf {C}_{\alpha }\). As \(\Lambda (A) \subset \Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }\), Observation 4.1 also guarantees that \(\Lambda (g(A)) \subset \mathsf {C}_{\alpha ^2}\). Thus for w in the union of the two Apollonian annuli in question, we can calculate the resolvent of \(g(A)\) at w using the holomorphic functional calculus:

$$\begin{aligned} (w - g(A))^{-1} = \frac{1}{2\pi i}\oint _{\partial \mathsf {C}_{\alpha }} (w - g(z))^{-1}(z - A)^{-1}d z, \end{aligned}$$

where by this we mean to sum the integrals over \(\partial \mathsf {C}^{+}_{\alpha }\) and \(\partial \mathsf {C}^{-}_{\alpha }\), both positively oriented. Taking norms, passing inside the integral, and applying Observation 4.1 one final time, we get:

$$\begin{aligned} \left\| (w - g(A))^{-1} \right\|&\le \frac{1}{2\pi }\oint _{\partial \mathsf {C}_{\alpha }}|(w - g(z))^{-1}|\cdot \Vert (z - A)^{-1}\Vert d z\\&\le \frac{\ell \left( \partial \mathsf {C}^{+}_{\alpha }\right) \sup _{y \in \mathsf {C}^{+}_{\alpha ^2}}|(w - y)^{-1}| + \ell \left( \partial \mathsf {C}^{-}_{\alpha }\right) \sup _{y \in \mathsf {C}^{-}_{\alpha ^2}}|(w - y)^{-1}|}{2\pi \epsilon } \\&\le \frac{1}{\epsilon } \frac{8\alpha }{(\alpha ' - \alpha ^2)(1 - \alpha ^2)}. \end{aligned}$$

In the last step we also use the forthcoming Lemma 4.5. Thus, with \(\epsilon '\) defined as in the theorem statement, \(\mathsf {A}_{\alpha ,\alpha '}\) contains none of the \(\epsilon '\)-pseudospectrum of \(g(A)\). Since \(\Lambda (g(A)) \subset \mathsf {C}_{\alpha ^2}\), Theorem 2.3 tells us that there can be no \(\epsilon '\)-pseudospectrum in the remainder of \(\mathbb {C}\setminus \mathsf {C}_{\alpha '}\), as such a connected component would need to contain an eigenvalue of \(g(A)\). \(\square \)

Fig. 3
figure 5

Illustration of the proof of Lemma 4.4

Lemma 4.5

Let \(1> \alpha , \beta > 0\) be given. Then for any \(x \in \partial \mathsf {C}_{\alpha }\) and \(y \in \partial \mathsf {C}_{\beta }\), we have \(|x-y| \ge (\alpha -\beta )/2\).

Proof

Without loss of generality \(x \in \partial \mathsf {C}^{+}_{\alpha }\) and \(y \in \partial \mathsf {C}^{+}_{\beta }\). Then, we have

$$\begin{aligned} |\alpha - \beta | = \left| |m(x)|- |m(y)|\right| \le |m(x)- m(y)| = \frac{2|x-y|}{{|1+x||1+y|}} \le 2|x-y|. \end{aligned}$$

\(\square \)

Lemma 4.4 will also be useful in bounding the condition numbers of the \(A_k\), which is necessary for the finite arithmetic analysis.

Corollary 4.6

(Condition Number Bound) Using the notation of Lemma 4.4, if \(\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }\), then

$$\begin{aligned} \Vert A^{-1}\Vert&\le \frac{1}{\epsilon } \quad \text {and} \quad \Vert A\Vert \le \frac{4\alpha }{(1 - \alpha )^2 \epsilon }. \end{aligned}$$

Proof

The bound \(\Vert A^{-1} \Vert \le 1/\epsilon \) follows from the fact that \(0 \notin \mathsf {C}_{\alpha } \supset \Lambda _{\epsilon }(A).\) In order to bound A we use the contour integral bound

$$\begin{aligned} \Vert A \Vert&= \left\| \frac{1}{2\pi i} \oint _{\partial \mathsf {C}_{\alpha }} z (z - A)^{-1}\,dz \right\| \\&\le \frac{\ell (\partial \mathsf {C}_{\alpha })}{2\pi } \left( \sup _{z \in \partial \mathsf {C}_{\alpha }} |z| \right) \frac{1}{\epsilon } \\&= \frac{4 \alpha }{1 - \alpha ^2} \frac{1+\alpha }{1-\alpha } \frac{1}{\epsilon }. \\ \end{aligned}$$

\(\square \)

Another direct application of Lemma 4.4 yields the following.

Lemma 4.7

Let \(\epsilon > 0\). If \(\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }\), and \( 1/\alpha> D > 1\) then for every N we have the guarantee

$$\begin{aligned} \Lambda _{\epsilon _N}(A_N) \subset \mathsf {C}_{\alpha _N}, \end{aligned}$$

for \(\alpha _N =(D\alpha )^{2^N}/D\) and \(\epsilon _N = \frac{\alpha _N \epsilon }{\alpha } \left( \frac{(D-1)(1-\alpha ^2)}{8D}\right) ^N \).

Proof

Define recursively \(\alpha _0 = \alpha \), \(\epsilon _0 = \epsilon \), \(\alpha _{k+1} = D \alpha _k^2\) and \(\epsilon _{k+1}= \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _0^2).\) It is easy to see by induction that this definition is consistent with the definition of \(\alpha _N\) and \(\epsilon _N\) given in the statement.

We will now show by induction that \(\Lambda _{\epsilon _k}(A_k) \subset \mathsf {C}_{\alpha _k}\). Assume the statement is true for k, so from Lemma 4.4 we have that the statement is also true for \(A_{k+1}\) if we pick the pseudospectral parameter to be

$$\begin{aligned} \epsilon ' = \epsilon _k \frac{(\alpha _{k+1}-\alpha _k^2)(1-\alpha _k^2)}{8\alpha _k} = \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _k^2). \end{aligned}$$

On the other hand,

$$\begin{aligned} \frac{1}{8} \epsilon _k \alpha _k (D-1)(1-\alpha _k^2) \ge \frac{1}{8} \epsilon _k \alpha _k (D-1) (1-\alpha _0^2) = \epsilon _{k+1}, \end{aligned}$$

which concludes the proof of the statement. \(\square \)

We are now ready to prove the main result of this section, a pseudospectral version of Proposition 4.2.

Proposition 4.8

Let \(A\in {\mathbb {C}}^{n\times n}\) be a diagonalizable matrix and assume that \(\Lambda _\epsilon (A) \subset \mathsf {C}_{\alpha }\) for some \(\alpha \in (0,1)\). Then, for any \(1< D < \frac{1}{\alpha }\) for every N we have the guarantee

$$\begin{aligned} \Vert A_N - \mathrm {sgn}(A) \Vert \le (D\alpha )^{2^N}\cdot \frac{\pi \alpha (1-\alpha ^2)^2}{8\epsilon }\cdot \left( \frac{8D}{(D-1)(1-\alpha ^2)}\right) ^{N+2}. \end{aligned}$$

Proof

Using the choice of \(\alpha _k\) and \(\epsilon _k\) given in the proof of Lemma 4.7 and the bound (27), we get that

$$\begin{aligned} \Vert A_N- \mathrm {sgn}(A)\Vert&\le \frac{8 \pi \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N} \\&= \frac{8 \pi \alpha _0 \alpha _N }{\epsilon _0 (1- \alpha _N)^2 (1+\alpha _N) } \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\&= (D \alpha _0)^{2^N} \frac{8 D^3 \pi \alpha _0}{(D-(D\alpha _0)^{2^{N}})^2(D+(D\alpha _0)^{2^N}) \epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\ {}&\le (D \alpha _0)^{2^N} \frac{8D^2 \pi \alpha _0}{(D-1)^2 \epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^N \\&= (D\alpha _0)^{2^N}\, \frac{\pi \alpha _0 (1-\alpha _0^2)^2}{8\epsilon _0} \left( \frac{8D}{(D-1)(1-\alpha _0^2)}\right) ^{N+2}, \end{aligned}$$

where the last inequality was taken solely to make the expression more intuitive, since not much is lost by doing so. \(\square \)

4.3 Finite Arithmetic

Finally, we turn to the analysis of \(\mathsf {SGN}\) in finite arithmetic. By making the machine precision small enough, we can bound the effect of roundoff to ensure that the parameters \(\alpha _k\), \(\epsilon _k\) are not too far from what they would have been in the exact arithmetic analysis above. We will stop the iteration before any of the quantities involved become prohibitively small, so we will only need \(\mathrm {polylog}(1-\alpha _0, \epsilon _0, \beta )\) bits of precision, where \(\beta \) is the accuracy parameter.

In exact arithmetic, recall that the Newton iteration is given by \(A_{k+1} = g(A_{k}) = \frac{1}{2} (A_k + A_k^{-1}).\) Here we will consider the finite arithmetic version \(\mathsf {G}\) of the Newton map \(g\), defined as \(\mathsf {G}(A) := g(A)+E_A\) where \(E_A\) is an adversarial perturbation coming from the roundoff error. Hence, the sequence of interest is given by \({\widetilde{A}}_0 := A\) and \({\widetilde{A}}_{k+1} := \mathsf {G}({\widetilde{A}}_k)\) .

In this subsection we will prove the following theorem concerning the runtime and precision of \(\mathsf {SGN}\). Our assumptions on the size of the parameters \(\alpha _0, \beta , \mu _{\mathsf {INV}}(n)\) and \(c_\mathsf {INV}\) are in place only to simplify the analysis of constants; these assumptions are not required for the execution of the algorithm.

Theorem 4.9

(Main guarantees for \(\mathsf {SGN}\)) Assume \(\mathsf {INV}\) is a \((\mu _{\mathsf {INV}}(n), c_\mathsf {INV})\)-stable matrix inversion algorithm satisfying Definition 2.7. Let \(\epsilon _0\in (0,1), \beta \in (0,1/12)\), assume \(\mu _{\mathsf {INV}}(n) \ge 1\) and \(c_\mathsf {INV}\log n \ge 1\), and assume \(A = {\widetilde{A}}_0\) is a floating-point matrix with \(\epsilon _0\)-pseudospectrum contained in \(\mathsf {C}_{\alpha _0}\) where \(0< 1 - \alpha _0 < 1/100\). Run \(\mathsf {SGN}\) with

$$\begin{aligned} N = \lceil \lg (1/(1- \alpha _0)) + 3 \lg \lg (1/(1-\alpha _0)) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil \end{aligned}$$

iterations (as specified in the statement of the algorithm). Then \(\widetilde{A_N}=\mathsf {SGN}(A)\) satisfies the advertised accuracy guarantee

$$\begin{aligned} \Vert \widetilde{A_N} - \mathrm {sgn}(A) \Vert \le \beta \end{aligned}$$

when run with machine precision satisfying

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_{\mathsf {SGN}} := \frac{ \alpha _0^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}, \end{aligned}$$

corresponding to at most

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}})) = O( \log n \log ^3(1/(1-\alpha _0)) (\log (1/\beta )+ \log (1/\epsilon _0)))\end{aligned}$$

required bits of precision. The number of arithmetic operations is at most

$$\begin{aligned} N(4 n^2 + T_\mathsf {INV}(n)) . \end{aligned}$$

Later on, we will need to call \(\mathsf {SGN}\) on a matrix with shattered pseudospectrum; the lemma below calculates acceptable parameter settings for shattering so that the pseudospectrum is contained in the required pair of Apollonian circles, satisfying the hypothesis of Theorem 4.9.

Lemma 4.10

If A has \(\epsilon \)-pseudospectrum shattered with respect to a grid \(\mathsf {g}= \mathsf {grid}(z_0,\omega ,s_1,s_2)\) that includes the imaginary axis as a grid line, then one has \(\Lambda _{\epsilon _0}(A) \subseteq \mathsf {C}_{\alpha _0}\) where \(\epsilon _0 = \epsilon /2\) and

$$\begin{aligned} \alpha _0 = 1 - \frac{\epsilon }{{{\,\mathrm{diam}\,}}(\mathsf {g})^2}. \end{aligned}$$

In particular, if \(\epsilon \) is at least \(1/\mathrm {poly}(n)\) and \(\omega s_1\) and \(\omega s_2\) are at most \(\mathrm {poly}(n\)), then \(\epsilon _0\) and \(1-\alpha _0\) are also at least \(1/\mathrm {poly}(n)\).

Proof

First, because it is shattered, the \(\epsilon /2\)-pseudospectrum of A is at least distance \(\epsilon /2\) from \(\mathsf {g}\). Recycling the calculation from Proposition 4.2, it suffices to take

$$\begin{aligned} \alpha _0^2 = \max _{z \in \Lambda _{\epsilon /2}(A)}\left( 1 - \frac{4|{\text {Re}}z|}{|z - \mathrm {sgn}(z)|^2}\right) . \end{aligned}$$

From what we just observed about the pseudospectrum, we can take \(|{\text {Re}}z| \ge \epsilon /2\). To bound the denominator, we can use the crude bound that any two points inside the grid are at distance no more than \({{\,\mathrm{diam}\,}}(\mathsf {g})\). Finally, we use \(\sqrt{1 - x} \le 1 - x/2\) for any \(x\in (0,1)\). \(\square \)

The proof of Theorem 4.9 will proceed as in the exact arithmetic case, with the modification that \(\epsilon _k\) must be decreased by an additional factor after each iteration to account for roundoff. At each step, we set the machine precision \({\textbf {u }}\) small enough so that the \(\epsilon _k\) remain close to what they would be in exact arithmetic. For the analysis we will introduce an explicit auxiliary sequence \(e_k\) that lower bounds the \(\epsilon _k\), provided that \({\textbf {u }}\) is small enough.

Lemma 4.11

(One-step additive error) Assume the matrix inverse is computed by an algorithm \(\mathsf {INV}\) satisfying the guarantee in Definition 2.7. Then \(\mathsf {G}(A) = g(A) + E\) for some error matrix E with norm

$$\begin{aligned} \Vert E \Vert \le \left( \Vert A \Vert + \Vert A^{-1} \Vert + \mu _{\mathsf {INV}}(n) \kappa (A)^{c_\mathsf {INV}\log n}\Vert A^{-1}\Vert \right) 2 \sqrt{n} {\textbf {u }}.\end{aligned}$$
(28)

The proof of this lemma is deferred to “Appendix A”.

With the error bound for each step in hand, we now move to the analysis of the whole iteration. It will be convenient to define \(s := 1 - \alpha _0\), which should be thought of as a small parameter. As in the exact arithmetic case, for \(k \ge 1,\) we will recursively define decreasing sequences \(\alpha _k\) and \(\epsilon _k\) maintaining the property

$$\begin{aligned} \Lambda _{\epsilon _k}({\widetilde{A}}_k) \subset \mathsf {C}_{\alpha _k} \qquad \text { for all }k \ge 0 \end{aligned}$$
(29)

by induction as follows:

  1. 1.

    The base case \(k=0\) holds because by assumption, \(\Lambda _{\epsilon _0} \subset \mathsf {C}_{\alpha _0}\).

  2. 2.

    Here we recursively define \(\alpha _{k+1}\). Set

    $$\begin{aligned} \alpha _{k+1} := (1 + s/4) \alpha _k^2. \end{aligned}$$

    In the notation of Sect. 4.2, this corresponds to setting \(D = 1+s/4\). This definition ensures that \(\alpha _k^2 \le \alpha _{k+1} \le \alpha _k\) for all k, and also gives us the bound \((1+s/4)\alpha _0 \le 1-s/2\). We also have the closed form

    $$\begin{aligned} \alpha _k = (1+s/4)^{2^k - 1} \alpha _0^{2^k}, \end{aligned}$$

    which implies the useful bound

    $$\begin{aligned} \alpha _k \le (1-s/2)^{2^k}. \end{aligned}$$
    (30)
  3. 3.

    Here we recursively define \(\epsilon _{k+1}\). Combining Lemma 4.4, the recursive definition of \(\alpha _{k+1}\), and the fact that \(1 - \alpha _k^2 \ge 1 - \alpha _0^2 \ge 1 - \alpha _0 = s\), we find that \(\Lambda _{\epsilon '}\left( g({\widetilde{A}}_k)\right) \subset \mathsf {C}_{\alpha _{k+1}}\), where

    $$\begin{aligned} \epsilon ' = \epsilon _k \frac{\left( \alpha _{k+1} - \alpha _k^2\right) (1-\alpha _k^2)}{8\alpha _k} = \epsilon _k \frac{s\alpha _k(1-\alpha _k^2)}{32} \ge \epsilon _k\frac{ \alpha _k s^2}{32}. \end{aligned}$$

    Thus in particular

    $$\begin{aligned} \Lambda _{\epsilon _k\alpha _k s^2/32} \left( g({\widetilde{A}}_k)\right) \subset \mathsf {C}_{\alpha _{k+1}}. \end{aligned}$$

    Since \({\widetilde{A}}_{k+1} = \mathsf {G}({\widetilde{A}}_k) = g({\widetilde{A}}_k) + E_k\), for some error matrix \(E_k\) arising from roundoff, Proposition 2.2 ensures that if we set

    $$\begin{aligned} \epsilon _{k+1} := \epsilon _k\frac{ s^2 \alpha _k }{32} - \Vert E_k \Vert \end{aligned}$$
    (31)

    we will have \(\Lambda _{\epsilon _{k+1}}({\widetilde{A}}_{k+1}) \subset \mathsf {C}_{\alpha _{k+1}}, \) as desired.

We now need to show that the \(\epsilon _{k}\) do not decrease too fast as k increases. In view of (31), it will be helpful to set the machine precision small enough to guarantee that \(\Vert E_k \Vert \) is a small fraction of \(\epsilon _k\frac{ \alpha _k s^2}{32}\).

First, we need to control the quantities \(\Vert {\widetilde{A}}_k\Vert \), \(\Vert {\widetilde{A}}_k^{-1}\Vert \), and \(\kappa ({\widetilde{A}}_k) =\Vert {\widetilde{A}}_k\Vert \Vert {\widetilde{A}}_k^{-1}\Vert \) appearing in our upper bound (28) on \(\Vert E_k \Vert \) from Lemma 4.11, as functions of \(\epsilon _k\). By Corollary 4.6, we have

$$\begin{aligned} \Vert {\widetilde{A}}_k^{-1}\Vert&\le \frac{1}{\epsilon _k} \quad \text {and} \quad \Vert {\widetilde{A}}_k\Vert \le 4\frac{\alpha _k}{(1 - \alpha _k)^2\epsilon _k} \le \frac{4}{s^2 \epsilon _k}. \end{aligned}$$

Thus, we may write the coefficient of \({\textbf {u }}\) in the bound (28) as

$$\begin{aligned} K_{\epsilon _k} := \left[ \frac{4}{s^2\epsilon _k} + \frac{1}{\epsilon _k} + \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2\epsilon _k^2} \right) ^{c_\mathsf {INV}\log n}\frac{1}{\epsilon _k} \right] 2 \sqrt{n} \end{aligned}$$

so that Lemma 4.11 reads

$$\begin{aligned} \Vert {E_k}\Vert \le K_{\epsilon _k} {\textbf {u }}. \end{aligned}$$
(32)

Plugging this into the definition (31) of \(\epsilon _{k+1}\), we have

$$\begin{aligned} \epsilon _{k+1} \ge \epsilon _k \frac{s^2\alpha _k}{32} - K_{\epsilon _k} {\textbf {u }}. \end{aligned}$$
(33)

Now suppose we take \({\textbf {u }}\) small enough so that

$$\begin{aligned} K_{\epsilon _k} {\textbf {u }}\le \frac{1}{3} \epsilon _k \frac{s^2\alpha _k}{32}. \end{aligned}$$
(34)

For such \({\textbf {u }}\), we then have

$$\begin{aligned} \epsilon _{k+1} \ge \frac{2}{3}\epsilon _k \frac{s^2\alpha _k}{32} = \frac{1}{48} \epsilon _k s^2 \alpha _k, \end{aligned}$$
(35)

which implies

$$\begin{aligned} \Vert E_k \Vert \le \frac{1}{2} \epsilon _{k+1}; \end{aligned}$$
(36)

this bound is loose but sufficient for our purposes. Inductively, we now have the following bound on \(\epsilon _k\) in terms of \(\alpha _k\):

Lemma 4.12

(Preliminary lower bound on \(\epsilon _k\)) Let \(k \ge 0\), and for all \(0 \le i \le k-1\), assume \({\textbf {u }}\) satisfies the requirement (34):

$$\begin{aligned} K_{\epsilon _i}{\textbf {u }}\le \frac{1}{3} \epsilon _i \frac{s^2 \alpha _i}{32}. \end{aligned}$$

Then, we have

$$\begin{aligned} \epsilon _k \ge e_k := \epsilon _0 \left( \frac{s^2}{50}\right) ^k \alpha _k. \end{aligned}$$

In fact, it suffices to assume the hypothesis only for \(i=k-1\).

Proof

The last statement follows from the fact that \(\epsilon _i\) is decreasing in i and \(K_{\epsilon _i}\) is increasing in i.

Since (34) implies (35), we may apply (35) repeatedly to obtain

$$\begin{aligned} \epsilon _k&\ge \epsilon _0 (s^2/48)^k \prod _{i=0}^{k-1} \alpha _i&\\&= \epsilon _0 (s^2/48)^k (1+s/4)^{2^k - 1 - k}\alpha _0^{2^k-1}&{\text { by the definition of }\alpha _i}\\&= \epsilon _0 \left( \frac{s^2}{48(1+s/4)}\right) ^k \frac{\alpha _k}{\alpha _0}&\\&\ge \epsilon _0 \left( \frac{s^2}{50}\right) ^k \alpha _k.&\alpha _0 \le 1, s < 1/8 \end{aligned}$$

\(\square \)

We now show that the conclusion of Lemma 4.12 still holds if we replace \(\epsilon _i\) everywhere in the hypothesis by \(e_i\), which is an explicit function of \(\epsilon _0\) and \(\alpha _0\) defined in Lemma 4.12. Note that we do not know \(\epsilon _i \ge e_i\) a priori, so to avoid circularity we must use a short inductive argument.

Corollary 4.13

(Lower bound on \(\epsilon _k\) with explicit hypothesis) Let \(k \ge 0\), and for all \(0 \le i \le k-1\), assume \({\textbf {u }}\) satisfies

$$\begin{aligned} K_{e_i} {\textbf {u }}\le \frac{1}{3} e_i \frac{s^2 \alpha _i}{32} \end{aligned}$$
(37)

where \(e_i\) is defined in Lemma 4.12. Then, we have

$$\begin{aligned} \epsilon _k \ge e_k.\end{aligned}$$

In fact, it suffices to assume the hypothesis only for \(i=k-1\).

Proof

The last statement follows from the fact that \(e_i\) is decreasing in i and \(K_{e_i}\) is increasing in i.

Assuming the full hypothesis of this lemma, we prove \(\epsilon _i \ge e_i\) for \(0 \le i \le k\) by induction on i. For the base case, we have \(\epsilon _0 \ge e_0 = \epsilon _0 \alpha _0\).

For the inductive step, assume \(\epsilon _i \ge e_i\). Then as long as \(i \le k-1\), the hypothesis of this lemma implies

$$\begin{aligned} K_{\epsilon _i} {\textbf {u }}\le \frac{1}{3} \epsilon _i \frac{s^2 \alpha _i}{32}, \ \end{aligned}$$

so we may apply Lemma 4.12 to obtain \(\epsilon _{i+1} \ge e_{i+1}\), as desired. \(\square \)

Lemma 4.14

(Main accuracy bound) Suppose \({\textbf {u }}\) satisfies the requirement (34) for all \(0 \le k \le N\). Then

$$\begin{aligned} \Vert {\widetilde{A}}_N - \mathrm {sgn}(A) \Vert \le \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2} + \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N}. \end{aligned}$$
(38)

Proof

Since \(\mathrm {sgn}= \mathrm {sgn}\circ g\), for every k we have

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_k})\Vert&= \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(g(\widetilde{A_k})) \Vert = \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) \\&\quad - \mathrm {sgn}(\widetilde{A_{k+1}} - E_k) \Vert . \end{aligned}$$

From the holomorphic functional calculus we can rewrite \(\Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_{k+1}} - E_k) \Vert \) as the norm of a certain contour integral, which in turn can be bounded as follows:

$$\begin{aligned}&\frac{1}{2\pi }\left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} [(z-\widetilde{A_{k+1}})^{-1} - (z - (\widetilde{A_{k+1}} - E_k))^{-1} ]\, dz \right. \\&\qquad \left. -\oint _{\partial \mathsf {C}^{-}_{\alpha _{k+1}}} [(z-\widetilde{A_{k+1}})^{-1} - (z - (\widetilde{A_{k+1}} - E_k))^{-1} ]\, dz \right\| \\&\quad = \frac{1}{2\pi }\left\| \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} [(z - (\widetilde{A_{k+1}} - E_k))^{-1}E_k(z-\widetilde{A_{k+1}})^{-1} ]\, dz \right. \\&\qquad \left. - \oint _{\partial \mathsf {C}^{-}_{\alpha _{k+1}}} [(z - (\widetilde{A_{k+1}} - E_k))^{-1}E_k(z-\widetilde{A_{k+1}})^{-1} ]\, dz \right\| \\&\quad \le \frac{1}{\pi } \oint _{\partial \mathsf {C}^{+}_{\alpha _{k+1}}} \Vert (z - (\widetilde{A_{k+1}} - E_k))^{-1}\Vert \Vert E_k \Vert \Vert (z - \widetilde{A_{k+1}})^{-1} \Vert \,dz\\&\quad \le \frac{1}{\pi }\ell (\partial \mathsf {C}^{+}_{\alpha _{k+1}}) \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}} \\&\quad = \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}}, \end{aligned}$$

where we use the definition (6) of pseudospectrum and Proposition 2.2, together with the property (29). Ultimately, this chain of inequalities implies

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_{k+1}}) - \mathrm {sgn}(\widetilde{A_{k}}) \Vert \le \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}}. \end{aligned}$$

Summing over all k and using the triangle inequality, we obtain

$$\begin{aligned} \Vert \mathrm {sgn}(\widetilde{A_N}) - \mathrm {sgn}(\widetilde{A_0}) \Vert&\le \sum _{k=1}^{N-1} \frac{4\alpha _{k+1}}{1 - \alpha _{k+1}^2} \Vert E_k \Vert \frac{1}{\epsilon _{k+1} - \Vert E_k \Vert }\frac{1}{\epsilon _{k+1}} \\&\le \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2}, \end{aligned}$$

where in the last step we use \(\alpha _k \le 1\) and \(1 - \alpha _{k+1}^2 \ge s\), as well as (36).

By Lemma 4.3 (to be precise, by repeating the proof of that lemma with \(\widetilde{A_N}\) substituted for \(A_N\)), we have

$$\begin{aligned} \Vert \widetilde{A_N} - \mathrm {sgn}(\widetilde{A_N})\Vert&\le \frac{8 \alpha _N^2}{(1- \alpha _N)^2 (1+\alpha _N) \epsilon _N} \\&\le \frac{8 }{s^2} \alpha _N \frac{\alpha _N}{\epsilon _N} \\&\le \frac{8}{s^2} \alpha _N\frac{1}{\epsilon _0}\left( \frac{50}{s^2}\right) ^N \\&\le \frac{8}{s^2\epsilon _0} (1-s/2)^{2^N}\left( \frac{50}{s^2}\right) ^N \\&\le \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N}. \end{aligned}$$

where we use \(s < 1/2\) in the last step.

Combining the above with the triangle inequality, we obtain the desired bound.

\(\square \)

We would like to apply Lemma 4.14 to ensure \(\Vert \widetilde{A_N} - \mathrm {sgn}(A) \Vert \) is at most \(\beta \), the desired accuracy parameter. The upper bound (38) in Lemma 4.14 is the sum of two terms; we will make each term less than \(\beta /2\). The bound for the second term will yield a sufficient condition on the number of iterations N. Given that, the bound on the first term will then give a sufficient condition on the machine precision \({\textbf {u }}\). This will be the content of Lemmas 4.16 and 4.17.

We start with the second term. The following preliminary lemma will be useful:

Lemma 4.15

Let \(1/800> t > 0\) and \(1/2> c > 0\) be given. Then for

$$\begin{aligned} j \ge \lg (1/t) + 2 \lg \lg (1/t) + \lg \lg (1/c) + 1.62, \end{aligned}$$

we have

$$\begin{aligned} \frac{(1-t)^{2^j}}{t^{2j}} < c. \end{aligned}$$

The proof is deferred to “Appendix A”.

Lemma 4.16

(Bound on second term of (38)) Suppose we have

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Then

$$\begin{aligned} \frac{8 \cdot 50^N }{s^{2N+2}\epsilon _0}(1 - s/2)^{2^N} \le \beta /2. \end{aligned}$$

Proof

It is sufficient that

$$\begin{aligned} \frac{8 \cdot 64^N }{s^{2N+2}\epsilon _0}(1 - s/8)^{2^N} \le \beta /2. \end{aligned}$$

The result now follows from applying Lemma 4.15 with \(c = \beta s^2 \epsilon _0/16\) and \(t=s/8\). \(\square \)

Now we move to the first term in the bound of Lemma 4.14.

Lemma 4.17

(Bound on first term of (38))

Suppose

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62, \end{aligned}$$

and suppose the machine precision \({\textbf {u }}\) satisfies

$$\begin{aligned} {\textbf {u }}\le \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N} . \end{aligned}$$

Then we have

$$\begin{aligned} \frac{8}{s} \sum _{k=0}^{N-1} \frac{\Vert E_k \Vert }{\epsilon _{k+1}^2} \le \beta /2. \end{aligned}$$

Proof

It suffices to show that for all \(0 \le k \le N-1\),

$$\begin{aligned} \Vert E_k \Vert \le \frac{ \beta \epsilon _{k+1}^2 s}{16N}. \end{aligned}$$

In view of (32), which says \(\Vert {E_k}\Vert \le K_{\epsilon _k} {\textbf {u }}\), it is sufficient to have for all \(0 \le k \le N-1\)

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{\epsilon _k}}\frac{ \beta \epsilon _{k+1}^2 s}{16N}. \end{aligned}$$
(39)

For this, we claim it is sufficient to have for all \(0 \le k \le N-1\)

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{e_k}}\frac{ \beta e_{k+1}^2 s}{16N}. \end{aligned}$$
(40)

Indeed, on the one hand, since \(\beta < 1/6\) and by the loose bound \(e_{k+1}< s \alpha _{k+1} < s \alpha _k\) we have that (40) implies \({\textbf {u }}\le \frac{1}{3K_{e_k}} \frac{ s^2 e_k}{32}\), which means that the assumption in Corollary 4.13 is satisfied. On the other hand, Corollary 4.13 yields \(e_k\le \epsilon _k\) for all \(0\le k \le N\), which in turn, combined with (40) would give (39) and conclude the proof.

We now show that (40) holds for all \(0\le k\le N-1\). Because \(1/K_{e_k}\) and \(e_k\) are decreasing in k, it is sufficient to have the single condition

$$\begin{aligned} {\textbf {u }}\le \frac{1}{K_{e_N}}\frac{ \beta e_{N}^2 s}{16N}. \end{aligned}$$

We continue the chain of sufficient conditions on \({\textbf {u }}\), where each line implies the line above:

$$\begin{aligned} {\textbf {u }}&\le \frac{1}{K_{e_N}}\frac{ \beta e_N^2 s}{16N} \\ {\textbf {u }}&\le \frac{1}{\left[ \frac{4}{s^2e_N} + \frac{1}{e_N} + \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2e_N^2} \right) ^{c_\mathsf {INV}\log n}\frac{1}{e_N} \right] 2 \sqrt{n}} \frac{ \beta e_N^2 s}{16N} \\ {\textbf {u }}&\le \frac{1}{6 \mu _{\mathsf {INV}}(n) \left( \frac{4}{s^2 e_N} \right) ^{c_\mathsf {INV}\log n + 1} 2\sqrt{n}}\frac{ \beta e_N^2 s}{16 N} \\ {\textbf {u }}&\le \frac{ \beta }{6 \cdot 2 \cdot 16 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{e_N s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

where we use the bound \(\frac{1}{e_N} \le \frac{4}{s^2 e_N^2}\) without much loss, and we also use our assumption \(\mu _{\mathsf {INV}}(n) \ge 1\) and \(c_\mathsf {INV}\log n \ge 1\) for simplicity.

Substituting the value of \(e_N\) as defined in Lemma 4.12, we get the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{\epsilon _0 (s^2/50)^N \alpha _N s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

Replacing \(\alpha _N\) by the smaller quantity \(\alpha _0^{2^N} = (1-s)^{2^N}\) and cleaning up the constants yields the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{\epsilon _0 (s^2/50)^N (1-s)^{2^N} s^2}{4} \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

Now we finally will use our hypothesis on the size of N to simplify this expression. Applying Lemma 4.16, we have

$$\begin{aligned} \epsilon _0 (s^2/50)^N /4 \ge \frac{4 (1-s)^{2^N}}{s^2 \beta }. \end{aligned}$$

Thus, our sufficient condition becomes

$$\begin{aligned} {\textbf {u }}\le \frac{ \beta }{192 \mu _{\mathsf {INV}}(n) \sqrt{n} N} \left( \frac{4(1-s)^{2^{N+1}}}{\beta } \right) ^{c_\mathsf {INV}\log n + 3}. \end{aligned}$$

To make the expression simpler, since \(c_\mathsf {INV}\log n + 3 \ge 4\) we may pull out a factor of \(4^4 > 192\) and remove the occurrences of \(\beta \) to yield the sufficient condition

$$\begin{aligned} {\textbf {u }}\le \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}. \end{aligned}$$

\(\square \)

Matching the statement of Theorem 4.9, we give a slightly cleaner sufficient condition on N that implies the hypothesis on N appearing in the above lemmas. The proof is deferred to “Appendix A”.

Lemma 4.18

(Final sufficient condition on N) If

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil , \end{aligned}$$

then

$$\begin{aligned} N \ge \lg (8/s) + 2 \lg \lg (8/s) + \lg \lg (16/(\beta s^2 \epsilon _0)) + 1.62. \end{aligned}$$

Taking the logarithm of the machine precision yields the number of bits required:

Lemma 4.19

(Bit length computation) Suppose

$$\begin{aligned} N = \lceil \lg (1/s) + 3 \lg \lg (1/s) + \lg \lg (1/(\beta \epsilon _0)) + 7.59 \rceil \end{aligned}$$

and

$$\begin{aligned} {\textbf {u }}_{\mathsf {SGN}} = \frac{ (1-s)^{2^{N+1}(c_\mathsf {INV}\log n + 3)}}{\mu _{\mathsf {INV}}(n) \sqrt{n} N}. \end{aligned}$$

Then

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) = O\big (\log n \log (1/s)^3 (\log (1/\beta ) + \log (1/\epsilon _0))\big ). \end{aligned}$$

Proof

In the course of the proof, for convenience we also record a nonasymptotic bound (for \(s<1/100\), \(\beta < 1/12\), \(\epsilon _0 < 1\) and \(c_\mathsf {INV}\log n > 1\) as in the hypothesis of Theorem 4.9), at the cost of making the computation somewhat messier.

Immediately we have

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) \le \lg \mu _{\mathsf {INV}}(n) + \frac{1}{2}\lg n + \lg N + (c_\mathsf {INV}\log n + 3) 2^{N+1} \log (1/(1-s)). \end{aligned}$$

Note that \(\log (1/(1-s)) < s\) for \(s < 1/2\). Also, \(2^{N+1} \le (1/s) \lg (1/s)^3 (\lg (1/\beta ) + \lg (1/\epsilon _0))2^{9.59}.\) Putting this together, we have

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}})&\le \lg \mu _{\mathsf {INV}}(n) + \frac{1}{2}\lg n + \lg N + 1000 (c_\mathsf {INV}\log n + 3)\lg (1/s)^3 (\lg (1/\beta ) \\&\quad + \lg (1/\epsilon _0)). \end{aligned}$$

We now crudely bound \(\lg N\). Note that for \(s < 1/100\) we have \(\lg (1/s) + 3 \lg \lg (1/s) + 7.59 \le 1/s\). Thus,

$$\begin{aligned} \lg N&\le \lg (1/s + \lg \lg (1/(\beta \epsilon _0)))&\\&\le \lg (1/s + \lg (1/(\beta \epsilon _0)))&\\&\le \lg (1/s) + \lg \lg (1/(\beta \epsilon _0))&\lg (a+b) \le \lg a + \lg b \text { for } a,b>2\\&\le \lg (1/s)^3 \lg (1/(\beta \epsilon _0)).&\end{aligned}$$

Combining the above, we may fold the \(\lg N\) and \(\lg n\) terms into the final term to obtain

$$\begin{aligned} \lg (1/{\textbf {u }}_{\mathsf {SGN}}) \le \lg \mu _{\mathsf {INV}}(n) + 5000c_\mathsf {INV}\log n \lg (1/s)^3 (\lg (1/\beta ) + \lg (1/\epsilon _0)) \end{aligned}$$
(41)

where we use that \(c_\mathsf {INV}\log n > 1\) and therefore \(c_\mathsf {INV}\log n + 3 < 4 c_\mathsf {INV}\log n.\)

Using that \(\mu _{\mathsf {INV}}(n) = \mathrm {poly}(n)\) and discarding subdominant terms, we obtain the desired asymptotic bound. \(\square \)

This completes the proof of Theorem 4.9. Finally, we may prove the theorem advertised in Sect. 1.

Proof of Theorem 1.5

Set \(\epsilon := \min \{ \frac{1}{K}, 1\}\). Then \(\Lambda _\epsilon (A)\) does not intersect the imaginary axis, and furthermore \(\Lambda _\epsilon (A) \subseteq D(0, 2)\) because \(\Vert A \Vert \le 1\). Thus, we may apply Lemma 4.10 with \({{\,\mathrm{diam}\,}}(\mathsf {g}) = 4\sqrt{2}\) to obtain parameters \(\alpha _0, \epsilon _0\) with the property that \(\log (1/(1-\alpha _0))\) and \(\log (1/\epsilon _0)\) are both \(O(\log K)\). Theorem 4.9 now yields the desired conclusion. \(\square \)

5 Spectral Bisection Algorithm

In this section we will prove Theorem 1.6. As discussed in Sect. 1, our algorithm is not new, and in its idealized form it reduces to the two following tasks:

Split::

Given an \(n\times n\) matrix A, find a partition of the spectrum into pieces of roughly equal size, and output spectral projectors \(P_{\pm }\) onto each of these pieces.

Deflate::

Given an \(n\times n\) rank-k projector P, output an \(n\times k\) matrix Q with orthogonal columns that span the range of P.

These routines in hand, on input A one can compute \(P_{\pm }\) and the corresponding \(Q_{\pm }\), and then find the eigenvectors and eigenvalues of \(A_{\pm } := Q_{\pm }^*A Q_{\pm }\). The observation below verifies that this recursion is sound.

Observation 5.1

The spectrum of A is exactly \(\Lambda (A_+) \sqcup \Lambda (A_-)\), and every eigenvector of A is of the form \(Q_{\pm }v\) for some eigenvector v of one of \(A_{\pm }\).

The difficulty, of course, is that neither of these routines can be executed exactly: we will never have access to true projectors \(P_{\pm }\), nor to the actual orthogonal matrices \(Q_{\pm }\) whose columns span their range, and must instead make do with approximations. Because our algorithm is recursive and our matrices nonnormal, we must take care that the errors in the sub-instances \(A_{\pm }\) do not corrupt the eigenvectors and eigenvalues we are hoping to find. Additionally, the Newton iteration we will use to split the spectrum behaves poorly when an eigenvalue is close to the imaginary axis, and it is not clear how to find a splitting which is balanced.

Our tactic in resolving these issues will be to pass to our algorithms a matrix and a grid with respect to which its \(\epsilon \)-pseudospectrum is shattered. To find an approximate eigenvalue, then, one can settle for locating the grid square it lies in; containment in a grid square is robust to perturbations of size smaller than \(\epsilon \). The shattering property is robust to small perturbations, inherited by the subproblems we pass to, and—because the spectrum is quantifiably far from the grid lines—allows us to run the Newton iteration in the first place.

Let us now sketch the implementations and state carefully the guarantees for \(\mathsf {SPLIT}\) and \(\mathsf {DEFLATE}\); the analysis of these will be deferred to Appendices B and C. Our splitting algorithm is presented a matrix A whose \(\epsilon \)-pseudospectrum is shattered with respect to a grid \(\mathsf {g}\). For any vertical grid line with real part h, \(\mathrm {Tr}\, \mathrm {sgn}(A-h)\) gives the difference between the number of eigenvalues lying to its left and right. As

$$\begin{aligned} |\mathrm {Tr}\,\mathsf {SGN}(A-h) - \mathrm {Tr}\, \mathrm {sgn}(A-h)| \le n\Vert \mathsf {SGN}(A-h) - \mathrm {sgn}(A-h)\Vert , \end{aligned}$$

we can determine these eigenvalue counts exactly by running \(\mathsf {SGN}\) to accuracy O(1/n) and rounding \(\mathrm {Tr}\, \mathsf {SGN}(A-h)\) to the nearest integer. We will show in “Appendix B” that, by mounting a binary search over horizontal and vertical lines of \(\mathsf {g}\), we will always arrive at a partition of the eigenvalues into two parts with size at least \(\min \{n/5,1\}\). Having found it, we run \(\mathsf {SGN}\) one final time at the desired precision to find the approximate spectral projectors.

figure c

Theorem 5.2

(Guarantees for \(\mathsf {SPLIT}\)) Assume \(\mathsf {INV}\) is a \((\mu _\mathsf {INV},c_\mathsf {INV})\)-stable matrix inversion algorithm satisfying Definition 2.7. Let \(\epsilon \le 0.5\), \(\beta \le 0.05/n\), and \(\Vert A\Vert \le 4\) and \(\mathsf {g}\) have side lengths of at most 8, and define

$$\begin{aligned} N_{\mathsf {SPLIT}} := \lg \frac{256}{\epsilon } + 3\lg \lg \frac{256}{\epsilon } + \lg \lg \frac{4}{\beta \epsilon } + 7.59. \end{aligned}$$

Then \(\mathsf {SPLIT}\) has the advertised guarantees when run on a floating point machine with precision

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_\mathsf {SPLIT}:= \min \left\{ \frac{\left( 1 - \frac{\epsilon }{256}\right) ^{2^{N_{\mathsf {SPLIT}}+1} (c_{\mathsf {INV}} \log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {SPLIT}}},\, \frac{\epsilon }{100 n},\, \frac{\epsilon ^2}{512} \right\} , \end{aligned}$$

Using at most

$$\begin{aligned} T_{\mathsf {SPLIT}}(n,\mathsf {g},\epsilon ,\beta ) \le 12 \lg \frac{1}{\omega (\mathsf {g})}\cdot N_{\mathsf {SPLIT}} \cdot \left( T_{\mathsf {INV}}(n) + O(n^2) \right) \end{aligned}$$

arithmetic operations. The number of bits required is

$$\begin{aligned} \lg 1/{\textbf {u }}_{\mathsf {SPLIT}} = O\left( \log n \log ^3 \frac{256}{\epsilon }\left( \log \frac{1}{\beta } + \log \frac{4}{\epsilon }\right) \right) . \end{aligned}$$

Deflation of the approximate projectors we obtain from \(\mathsf {SPLIT}\) amounts to a standard rank-revealing QR factorization. This can be achieved deterministically in \(O(n^3)\) time with the classic algorithm of Gu and Eisenstat [36], or probabilistically in matrix-multiplication time with a variant of the method of [22]; we will use the latter.

figure d

Theorem 5.3

(Guarantees for \(\mathsf {DEFLATE}\)) Assume \(\mathsf {MM}\) and \(\mathsf {QR}\) are matrix multiplication and QR factorization algorithms satisfying Definitions 2.6 and 2.8. Then \(\mathsf {DEFLATE}\) has the advertised guarantees when run on a machine with precision:

$$\begin{aligned} {\textbf {u }}\le {\textbf {u }}_\mathsf {DEFLATE}:= \min \left\{ \frac{\beta }{ 4\Vert {\widetilde{P}}\Vert \max (\mu _{\mathsf {QR}}(n),\mu _{\mathsf {MM}}(n))}, \frac{\eta }{2\mu _{\mathsf {QR}}(n)}\right\} . \end{aligned}$$

The number of arithmetic operations is at most:

$$\begin{aligned} T_{\mathsf {DEFLATE}}(n) = n^2 T_\mathsf {N}+ 2T_\mathsf {QR}(n)+T_\mathsf {MM}(n). \end{aligned}$$

Remark 5.4

The proof of the above theorem, which is deferred to “Appendix C”, closely follows and builds on the analysis of the randomized rank revealing factorization algorithm (\(\mathsf {RURV}\)) introduced in [22] and further studied in [7]. The parameters in the theorem are optimized for the particular application of finding a basis for a deflating subspace given an approximate spectral projector.

The main difference with the analysis in [22] and [7] is that here, to make it applicable to complex matrices, we make use of Haar unitary random matrices instead of Haar orthogonal random matrices. In our analysis of the unitary case, we discovered a strikingly simple formula (Corollary C.6) for the density of the smallest singular value of an \(r\times r\) sub-matrix of an \(n\times n\) Haar unitary; this formula is leveraged to obtain guarantees that work for any n and r, and not only for when \(n-r \ge 30\), as was the case in [7]. Finally, we explicitly account for finite arithmetic considerations in the Gaussian randomness used in the algorithm, where true Haar unitary matrices can never be produced.

We are ready now to state completely an algorithm \(\mathsf {EIG}\) which accepts a shattered matrix and grid and outputs approximate eigenvectors and eigenvalues with a forward-error guarantee. Aside from the a priori un-motivated parameter settings in lines 2 and 3—which we promise to justify in the analysis to come—\(\mathsf {EIG}\) implements an approximate version of the split and deflate framework that began this section.

figure e

Theorem 5.5

(\(\mathsf {EIG}\): Finite Arithmetic Guarantee) Assume \(\mathsf {MM}, \mathsf {QR}\), and \(\mathsf {INV}\) are numerically stable algorithms for matrix multiplication, QR factorization, and inversion satisfying Definitions 2.6, 2.8, and 2.7. Let \(\delta < 1\), \(A \in \mathbb {C}^{n\times n}\) have \(\Vert A\Vert \le 3.5\) and, for some \(\epsilon < 1/2\), have \(\epsilon \)-pseudospectrum shattered with respect to a grid \(\mathsf {g}= \mathsf {grid}(z_0,\omega ,s_1,s_2)\) with side lengths at most 8 and \(\omega \le 1\). Define

$$\begin{aligned} N_{\mathsf {EIG}} := \lg \frac{256 n}{\epsilon } + 3\lg \lg \frac{256 n}{\epsilon } + \lg \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59. \end{aligned}$$

Then \(\mathsf {EIG}\) has the advertised guarantees when run on a floating point machine with precision satisfying:

$$\begin{aligned} \lg 1/{\textbf {u }}&\ge \max \left\{ \lg ^3 \frac{ n}{\epsilon }\lg \left( \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8} \right) 2^{9.59}(c_{\mathsf {INV}}\log n + 3) + \lg N_{\mathsf {EIG}}, \lg \frac{(5n)^{30}}{\theta ^2\delta ^4\epsilon ^8} \right. \\&\quad \left. + \lg \max \{\mu _{\mathsf {MM}}(n),\mu _{\mathsf {QR}}(n),n\} \right\} \\&= O\left( \log ^3 \frac{n}{\epsilon }\log \frac{n}{\theta \delta \epsilon }\log n\right) . \end{aligned}$$

The number of arithmetic operations is at most

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta , n)&= 60 N_{\mathsf {EIG}}\lg \frac{1}{\omega (\mathsf {g})}\left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 10 T_{\mathsf {QR}}(n) + 25 T_{\mathsf {MM}}(n) \\&= O\left( \log \frac{1}{\omega (\mathsf {g})}\left( \log \frac{n}{\epsilon } + \log \log \frac{1}{\theta \delta }\right) T_{\mathsf {MM}}(n) \right) . \end{aligned}$$

Remark 5.6

We have not fully optimized the large constant \(2^{9.59}\) appearing in the bit length above.

Theorem 5.5 easily implies Theorem 1.6 when combined with \(\mathsf {SHATTER}\).

Theorem 5.7

(Restatement of Theorem 1.6) There is a randomized algorithm \(\mathsf {EIG}\) which on input any matrix \(A\in \mathbb {C}^{n\times n}\) with \(\Vert A\Vert \le 1\) and a desired accuracy parameter \(\delta \in (0,1)\) outputs a diagonal D and invertible V such that

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \delta \quad \mathrm {and}\quad \kappa (V) \le 32n^{2.5}/\delta \end{aligned}$$

in

$$\begin{aligned} O\left( T_\mathsf {MM}(n)\log ^2\frac{n}{\delta }\right) \end{aligned}$$

arithmetic operations on a floating point machine with

$$\begin{aligned} O\left( \log ^4\frac{n}{\delta }\log n\right) \end{aligned}$$

bits of precision, with probability at least \(1-14/n\). Here \(T_\mathsf {MM}(n)\) refers to the running time of a numerically stable matrix multiplication algorithm (detailed in Sect. 2.5).

Proof

Given A and \(\delta \), consider the following two step algorithm:

  1. 1.

    \((X, \mathsf {g}, \epsilon )\leftarrow \mathsf {SHATTER}(A,\delta /8)\).

  2. 2.

    \((V,D)\leftarrow \mathsf {EIG}(X,\delta ',\mathsf {g},\epsilon ,1/n,n)\), where

    $$\begin{aligned} \delta ' := \frac{\delta ^3}{n^{4.5} \cdot 6 \cdot 128 \cdot 2}. \end{aligned}$$
    (42)

With probability at least \(1 - 13/n\), \(\mathsf {SHATTER}(A,\delta /8)\) succeeds, in which case the output \((X,\mathsf {grid},\epsilon )\) output easily satisfy the assumptions in Theorem 5.5: \(\delta ' \le \delta < 1\), \(\epsilon = \tfrac{(\delta /8)^5}{32 n^9} \le 1/2\), \(\mathsf {g}\) is defined by \(\mathsf {SHATTER}\) to have side length 8, \(\Vert X\Vert \le \Vert A\Vert + \Vert X - A\Vert \le 1 + 4(\delta /8) \le 3.5\), and X has \(\epsilon \)-pseudospectrum shattered with respect to \(\mathsf {g}\). On this event, \(X = WCW^{-1}\), and (using the proof of Theorem 3.6) if we normalize W to have unit length columns, then \(\kappa (W) = \Vert W\Vert \Vert W^{-1}\Vert \le 8n^2/\delta \).

We will show that the choice of \(\delta '\) in (42) guarantees

$$\begin{aligned} \Vert X-VDV^{-1}\Vert \le \delta /2. \end{aligned}$$

Since \(\Vert X\Vert \le \Vert A\Vert + \Vert A - X\Vert \le 1 + 4\gamma \le 3\) from Theorem 3.13, the hypotheses of Theorem 5.5 are satisfied. Thus \(\mathsf {EIG}\) succeeds with probability at least \(1-1/n\), and by a union bound, both \(\mathsf {EIG}\) and \(\mathsf {SHATTER}\) succeed with probability at least \(1 - 14/n\). On this event, we have \(V=W+E\) for some \(\Vert E\Vert \le \delta '\sqrt{n}\), so

$$\begin{aligned} \Vert V-W\Vert \le \delta '\sqrt{n}, \end{aligned}$$

as well as

$$\begin{aligned} \sigma _n(V)\ge \sigma _n(W)-\Vert E\Vert \ge \frac{\delta }{8n^2}-\delta '\sqrt{n}\ge \frac{\delta }{16n^2}, \end{aligned}$$

since our choice of \(\delta '\) satisfies the much cruder bound of

$$\begin{aligned} \delta '\le \frac{\delta }{16n^{2.5}}, \end{aligned}$$

This implies that

$$\begin{aligned} \kappa (V)=\Vert V\Vert \Vert V^{-1}\Vert \le 2\sqrt{n}\cdot \frac{16n^2}{\delta }, \end{aligned}$$

establishing the last item of the theorem. We can control the perturbation of the inverse as:

$$\begin{aligned} \Vert V^{-1} - W^{-1} \Vert&= \Vert W^{-1} (W - V) V^{-1} \Vert \\&\le \kappa (W)\Vert W - V\Vert \Vert V^{-1}\Vert \\&\le \frac{8n^2}{\delta }\cdot \delta '\sqrt{n} \cdot \frac{16 n^2}{\delta }\\&\le \frac{128 n^{4.5}\delta '}{\delta ^2}. \end{aligned}$$

The grid output by \(\mathsf {SHATTER}(A,\delta /8)\) has \(\omega = \tfrac{\delta ^4}{4*8^4*n^5} \le \tfrac{\delta }{\sqrt{2}}\) provided \(\delta < 1\). Thus the guarantees on \(\mathsf {EIG}\) in Theorem 5.5 tell us each eigenvalue of \(X = WCW^{-1}\) shares a grid square with exactly one diagonal entry of D, which means that \(\Vert C - D\Vert \le \sqrt{2}\omega \le \delta \). So, we have:

$$\begin{aligned} \Vert VDV^{-1}-WCW^{-1}\Vert&\le \Vert (V-W) D V^{-1} \Vert + \Vert W (D-C) V^{-1} \Vert \\&\quad + \Vert WC (V^{-1} - W^{-1}) \Vert \\&\le \delta ' \sqrt{n} \cdot 5 \cdot \frac{16n^2}{\delta } + \sqrt{n} \delta ' \frac{16 n^2}{\delta } + \sqrt{n} \cdot 5 \cdot \frac{128n^{4.5} \delta '}{\delta ^2} \\&= \frac{\delta ' n^{4.5}}{\delta } \left( 5 \cdot 16 + 16 + \frac{5 \cdot 128}{\delta } \right) \\&\le \frac{\delta ' n^{4.5}}{\delta ^2} \cdot 6\cdot 128 \end{aligned}$$

which is at most \(\delta /2\), for \(\delta '\) chosen as above. We conclude that

$$\begin{aligned} \Vert A-VDV^{-1}\Vert \le \Vert A-X\Vert +\Vert X-VDV^{-1}\Vert \le \delta , \end{aligned}$$

with probability \(1-14/n\) as desired.

To compute the running time and precision, we observe that \(\mathsf {SHATTER}\) outputs a grid with parameters

$$\begin{aligned} \omega = \Omega \left( \frac{\delta ^4}{n^5}\right) ,\quad \epsilon =\Omega \left( \frac{\delta ^5}{n^9}\right) . \end{aligned}$$

Plugging this into the guarantees of \(\mathsf {EIG}\), we see that it takes

$$\begin{aligned} O\left( \log \frac{n}{\delta }\left( \log \frac{n}{\delta } + \log \log \frac{n}{\delta }\right) T_{\mathsf {MM}}(n) \right) = O(T_\mathsf {MM}(n)\log ^2(n/\delta )) \end{aligned}$$

arithmetic operations, on a floating point machine with precision

$$\begin{aligned} O\left( \log ^3 \frac{n}{\delta }\log \frac{n}{\delta }\log n\right) = O(\log ^4(n/\delta )\log (n)) \end{aligned}$$

bits, as advertised. \(\square \)

5.1 Proof of Theorem 5.5

A key stepping-stone in our proof will be the following elementary result controlling the spectrum, pseudospectrum, and eigenvectors after perturbing a shattered matrix.

Lemma 5.8

(Eigenvector Perturbation for a Shattered Matrix) Let \(\Lambda _{\epsilon }(A)\) be shattered with respect to a grid whose squares have side length \(\omega \), and assume that \(\Vert {{\widetilde{A}}} - A\Vert \le \eta < \epsilon \). Then, (i) each eigenvalue of \({{\widetilde{A}}}\) lies in the same grid square as exactly one eigenvalue of A, (ii) \(\Lambda _{\epsilon - \eta }(\widetilde{A})\) is shattered with respect to the same grid, and (iii) for any right unit eigenvector \({{\widetilde{v}}}\) of \({\widetilde{A}}\), there exists a right unit eigenvector of A corresponding to the same grid square, and for which

$$\begin{aligned} \Vert {{\widetilde{v}}} - v\Vert \le \frac{\sqrt{8}\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Proof

For (i), consider \(A_t = A + t({{\widetilde{A}}} - A)\) for \(t \in [0,1]\). By continuity, the entire trajectory of each eigenvalue is contained in a unique connected component of \(\Lambda _\eta (A) \subset \Lambda _\epsilon (A)\). For (ii), \(\Lambda _{\epsilon - \eta }({{\widetilde{A}}}) \subset \Lambda _{\epsilon }(A)\), which is shattered by hypothesis. Finally, for (iii), let \(w^*\) and \({\widetilde{w}}^*\) be the corresponding left eigenvectors to v and \({\widetilde{v}}\), respectively, normalized so that \(w^*v = {\widetilde{w}}^*{\widetilde{v}} = 1\). Let \(\Gamma \) be the boundary of the grid square containing the eigenvalues associated to v and \({\widetilde{v}}\), respectively. Then, using a contour integral along \(\Gamma \) as in (13) above, one gets

$$\begin{aligned} \Vert {\widetilde{v}}{\widetilde{w}}^*- vw^*\Vert \le \frac{2\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Thus, using that \(\Vert v\Vert =1\) and \(w^*v = 1\),

$$\begin{aligned} \Vert {\widetilde{v}}{\widetilde{w}}^*-vw^*\Vert \ge \Vert ({\widetilde{v}}{\widetilde{w}}^*-vw^*) v\Vert = \Vert ({\widetilde{w}}^* v) {\widetilde{v}}-v\Vert . \end{aligned}$$

Now, since \(({\widetilde{v}}^*v) {\widetilde{v}}\) is the orthogonal projection of v onto the span of \({\widetilde{v}}\), we have that

$$\begin{aligned} \Vert ({\widetilde{w}}^*v){\widetilde{v}}- v\Vert \ge \Vert ({\widetilde{v}}^*v) {\widetilde{v}}- v\Vert = \sqrt{1-|{\widetilde{v}}^*v|^2}. \end{aligned}$$

Multiplying v by a phase we can assume without loss of generality that \({\widetilde{v}}^* v\ge 0\) which implies that

$$\begin{aligned} \sqrt{1-({\widetilde{v}}^*v)^2} = \sqrt{(1-{\widetilde{v}}^*v)(1+{\widetilde{v}}^*v)} \ge \sqrt{1-{\widetilde{v}}^*v}. \end{aligned}$$

The above discussion can now be summarized in the following chain of inequalities

$$\begin{aligned} \sqrt{1-{\widetilde{v}}^*v}\le \sqrt{1-({\widetilde{v}}^*v)^2}\le \Vert ({\widetilde{w}}^*v){\widetilde{v}}- v\Vert \le \Vert {\widetilde{v}}{\widetilde{w}}^*-vw^*\Vert \le \frac{2\omega }{\pi }\frac{\eta }{\epsilon (\epsilon - \eta )}. \end{aligned}$$

Finally, note that \(\Vert v-{\widetilde{v}}\Vert = \sqrt{2-2{\widetilde{v}}^*v} \le \frac{\sqrt{8}\omega }{\pi } \frac{\eta }{\epsilon (\epsilon - \eta )}\) as we wanted to show. \(\square \)

The algorithm \(\mathsf {EIG}\) works by recursively reducing to subinstances of smaller size, but requires a pseudospectral guarantee to ensure speed and stability. We thus need to verify that the pseudospectrum does not deteriorate too substantially when we pass to a sub-problem.

Lemma 5.9

(Shattering is preserved after compression) Suppose P is a spectral projector of \(A\in \mathbb {C}^{n\times n}\) of rank k. Let \(Q\in \mathbb {C}^{n\times k}\) be such that \(Q^*Q=I_k\) and that its columns span the same space as the columns of P. Then for every \(\epsilon >0\),

$$\begin{aligned} \Lambda _\epsilon (Q^*AQ)\subset \Lambda _\epsilon (A). \end{aligned}$$

Alternatively, the same pseudospectral inclusion holds if again \(Q^*Q=I_k\) and, instead, the columns of Q span the same space as the rows of P.

Proof

We will first analyze the case when the columns of Q span the same space as the columns of P. To begin, note that if \(z\in \Lambda _\epsilon (Q^*AQ)\) then there exists \(v\in \mathbb {C}^k\) satisfying \(\Vert (z-Q^*AQ)v\Vert \le \epsilon \Vert v\Vert \). Since \(I_k=Q^*I_nQ\) we have

$$\begin{aligned} \Vert Q^*(z-A)Qv\Vert \le \epsilon \Vert v\Vert . \end{aligned}$$

And, because \(Q^*\) acts as an isometry on \(\mathrm {range}(Q)\) (the span of the columns of Q) and by assumption this space is invariant under P (and hence under \((z-A)\)), we have that \((z-A)Qv\in \mathrm {range}(Q)\), and therefore \(\Vert Q^*(z-A)Qv\Vert = \Vert (z-A)Qv\Vert \). From where we obtain

$$\begin{aligned} \Vert (z-A)Qv\Vert \le \epsilon \Vert v\Vert =\epsilon \Vert Qv\Vert , \end{aligned}$$

showing that \(z\in \Lambda _\epsilon (A)\).

For the case in which the columns of Q span the rows of P, the above proof can be easily modified by now taking v with the property that \(\Vert v^* Q^* (z-A)Q\Vert \le \epsilon \Vert v\Vert \). \(\square \)

Observation 5.10

Since \(\delta ,\omega (\mathsf {g}),\epsilon \le 1\), our assumption on \(\eta \) in Line 2 of the pseudocode of \(\mathsf {EIG}\) implies the following bounds on \(\eta \) which we will use below:

$$\begin{aligned} \eta \le \min \left\{ 0.02,\epsilon /75,\delta /100,\frac{\delta \epsilon ^2}{200\omega (\mathsf {g})}\right\} . \end{aligned}$$

Initial lemmas in hand, let us begin to analyze the algorithm. At several points we will make an assumption on the machine precision in the margin. These will be collected at the end of the proof, where we will verify that they follow from the precision hypothesis of Theorem 5.5.

Correctness.

Lemma 5.11

(Accuracy of \(\widetilde{\lambda _i}\)) When \(\mathsf {DEFLATE}\) succeeds, each eigenvalue of A shares a square of \(\mathsf {g}\) with a unique eigenvalue of either \(\widetilde{A_{+}}\) or \(\widetilde{A_{-}}\), and furthermore \(\Lambda _{4\epsilon /5} (\widetilde{A_{\pm }}) \subset \Lambda _\epsilon (A)\).

Proof

Let \(P_{\pm }\) be the true projectors onto the two bisection regions found by \(\mathsf {SPLIT}(A,\beta )\), \(Q_{\pm }\) be the matrices whose orthogonal columns span their ranges, and \(A_{\pm } := Q_{\pm }^*A Q_{\pm }\). From Theorem 5.3, on the event that \(\mathsf {DEFLATE}\) succeeds, the approximation \(\widetilde{Q_{\pm }}\) that it outputs satisfies \(\Vert \widetilde{Q_{\pm }} - Q_{\pm }\Vert \le \eta \), so in particular \(\Vert \widetilde{Q_{\pm }}\Vert \le 2\) as \(\eta \le 1\). The error \(E_{6,\pm }\) from performing the matrix multiplications necessary to compute \(\widetilde{A_{\pm }}\) admits the bound

$$\begin{aligned} \Vert E_{6,\pm }\Vert&\le \mu _{\mathsf {MM}}(n)\Vert \widetilde{Q_{\pm }}\Vert \Vert A\widetilde{Q_{\pm }}\Vert {\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2\Vert \widetilde{Q_{\pm }}A\Vert {\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2\Vert \widetilde{Q_{\pm }}\Vert ^2\Vert A\Vert {\textbf {u }}\\&\le 16\left( \mu _{\mathsf {MM}}(n){\textbf {u }}+ \mu _{\mathsf {MM}}(n)^2{\textbf {u }}^2\right) \quad \quad \quad \Vert A\Vert \le 4 \text { and } \Vert \widetilde{Q_{\pm }}\Vert \le 1 + \eta \le 1.02 \le \sqrt{2} \\&\le 3\eta \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad {\textbf {u }}\le \frac{\eta }{10 \mu _{\mathsf {MM}}(n)^2}. \end{aligned}$$

Iterating the triangle inequality, we obtain

$$\begin{aligned} \Vert \widetilde{A_{\pm }} - A_{\pm }\Vert&\le \Vert E_{6,\pm }\Vert + \Vert (\widetilde{Q_{\pm }} - Q_{\pm })A\widetilde{Q_{\pm }}\Vert + \Vert Q_{\pm }A(\widetilde{Q_{\pm }} - Q_{\pm })\Vert \\&\le 3\eta + 8\eta + 4\eta \quad \quad \quad \Vert \widetilde{Q_{\pm }} - Q_{\pm }\Vert \le \eta \\&\le \epsilon /5 \quad \quad \quad \quad \quad \quad \quad \eta \le \epsilon /75. \end{aligned}$$

We can now apply Lemma 5.8. \(\square \)

Everything is now in place to show that, if every call to \(\mathsf {DEFLATE}\) succeeds, \(\mathsf {EIG}\) has the advertised accuracy guarantees. After we show this, we will lower bound this success probability and compute the running time.

When \(A \in \mathbb {C}^{1\times 1}\), the algorithm works as promised. Assume inductively that \(\mathsf {EIG}\) has the desired guarantees on instances of size strictly smaller than n. In particular, maintaining the notation from the above lemmas, we may assume that

$$\begin{aligned} ({\widetilde{V}}_{\pm }, \widetilde{D_{\pm }}) = \mathsf {EIG}(\widetilde{A_{\pm }},4\epsilon /5,\mathsf {g}_{\pm },4\delta /5,\theta ,n) \end{aligned}$$

satisfy (i) each eigenvalue of \(\widetilde{D_{\pm }}\) shares a square of \(\mathsf {g}_{\pm }\) with exactly one eigenvalue of \(\widetilde{A_{\pm }}\), and (ii) each column of \(\widetilde{V_{\pm }}\) is \(4\delta /5\)-close to a true eigenvector of \(\widetilde{A_{\pm }}\). From Lemma 5.8, each eigenvalue of \(\widetilde{A_{\pm }}\) shares a grid square with exactly one eigenvalue of A, and thus the output

$$\begin{aligned} {\widetilde{D}} = \begin{pmatrix} \widetilde{D_+} &{} \\ &{} \widetilde{D_-} \end{pmatrix} \end{aligned}$$

satisfies the eigenvalue guarantee.

To verify that the computed eigenvectors are close to the true ones, let \(\widetilde{{\widetilde{v}}_{\pm }}\) be some approximate right unit eigenvector of one of \(\widetilde{A_{\pm }}\) output by \(\mathsf {EIG}\) (with norm \(1 \pm n{\textbf {u }}\)), \({\widetilde{v}}_{\pm }\) the exact unit eigenvector of \(\widetilde{A_\pm }\) that it approximates, and \(v_{\pm }\) the corresponding exact unit eigenvector of \(A_{\pm }\). Recursively, \(\mathsf {EIG}(A,\epsilon ,\mathsf {g},\delta ,\theta ,n)\) will output an approximate unit eigenvector

$$\begin{aligned} {\widetilde{v}} := \frac{\widetilde{Q_{\pm }} \widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }} \widetilde{{\widetilde{v}}_{\pm }} + e\Vert } + e', \end{aligned}$$

whose proximity to the actual eigenvector \(v := Q v_{\pm }\) we need now to quantify. The error terms here are e, a column of the error matrix \(E_{8}\) whose norm we can crudely bound by

$$\begin{aligned} \Vert e\Vert \le \Vert E_{8}\Vert \le \mu _{\mathsf {MM}}(n) \Vert \widetilde{Q_{\pm }}\Vert \Vert \widetilde{V_{\pm }}\Vert {\textbf {u }}\le 4\mu _{\mathsf {MM}}(n){\textbf {u }}\le \eta , \end{aligned}$$

and \(e'\), a column \(E_9\) incurred by performing the normalization in floating point; in our initial discussion of floating point arithmetic we assumed in (16) that \(\Vert e'\Vert \le n{\textbf {u }}\).

First, since \({\widetilde{v}} - e'\) and \(\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\) are parallel, the distance between them is just the difference in their norms:

$$\begin{aligned} \left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } - \widetilde{Q_\pm }\widetilde{{\widetilde{v}}_{\pm }} + e \right\|&\le \left| \Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert - 1\right| \le (1 + \eta )(1 + {\textbf {u }}) \\&\quad +\, 4\mu _{\mathsf {MM}}{\textbf {u }}- 1 \le 4\eta . \end{aligned}$$

Inductively \(\Vert \widetilde{{\widetilde{v}}_{\pm }} - \widetilde{{\widetilde{v}}_{\pm }} \Vert \le 4\delta /5\), and since \(\Vert A_{\pm } - \widetilde{A_{\pm }}\Vert \le \epsilon /5\) and \(A_{\pm }\) has shattered \(\epsilon \)-pseudospectrum from Lemma 5.9, Lemma 5.8 ensures

$$\begin{aligned} \Vert \widetilde{{\widetilde{v}}_{\pm }} - v_{\pm }\Vert&\le \frac{\sqrt{8}\omega (\mathsf {g}) \cdot 15\eta }{\pi \cdot \epsilon (\epsilon - 15\eta )} \\&\le \frac{\sqrt{8}\omega (\mathsf {g})\cdot 15\eta }{\pi \cdot 4\epsilon ^2/5}&\eta \le \epsilon /75 \\&\le \delta /10&\eta \le \frac{\delta \epsilon ^2}{200 \omega (\mathsf {g})}. \end{aligned}$$

Thus putting together the above, iterating the triangle identity, and using \(\Vert Q_{\pm }\Vert = 1\),

$$\begin{aligned} \left\| {\widetilde{v}} - v \right\|&=\left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } + e' - Q_{\pm }v_{\pm } \right\| \\&\le \left\| \frac{\widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e}{\Vert \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e\Vert } - \widetilde{Q_{\pm }}\widetilde{{\widetilde{v}}_{\pm }} + e \right\| + \Vert e'\Vert + \Vert e\Vert + \Vert (\widetilde{Q_{\pm }} - Q_{\pm })\widetilde{{\widetilde{v}}_{\pm }}\Vert \\&\quad + \Vert Q_{\pm }(\widetilde{{\widetilde{v}}_{\pm }} - {\widetilde{v}}_{\pm })\Vert + \Vert Q_{\pm }({\widetilde{v}}_{\pm } - v_{\pm })\Vert \\&\le 4\eta + n{\textbf {u }}+ \mu _{\mathsf {MM}}(n){\textbf {u }}+ \eta (1 + n{\textbf {u }}) + 4\delta /5 + \delta /10 \\&\le 8\eta + 4\delta /5 + \delta /10 n{\textbf {u }},\mu _{\mathsf {MM}}(n) {\textbf {u }}\le \eta \le \delta \eta \le \delta /200. \end{aligned}$$

This concludes the proof of correctness of \(\mathsf {EIG}\).

Running Time and Failure Probability. Let’s begin with a simple lemma bounding the depth of \(\mathsf {EIG}\)’s recursion tree.

Lemma 5.12

(Recursion Depth) The recursion tree of \(\mathsf {EIG}\) has depth at most \(\log _{5/4} n\), and every branch ends with an instance of size \(1\times 1\).

Proof

By Theorem 5.2, \(\mathsf {SPLIT}\) can always find a bisection of the spectrum into two regions containing \(n_\pm \) eigenvalues, respectively, with \(n_+ + n_- = n\) and \(n_{\pm } \ge 4n/5\), and when \(n\le 5\) can always peel off at least one eigenvalue. Thus the depth d(n) satisfies

$$\begin{aligned} d(n) = {\left\{ \begin{array}{ll} n &{} n\le 5 \\ 1 + \max _{\theta \in [1/5,4/5]} d(\theta n) &{} n > 5 \end{array}\right. } \end{aligned}$$
(43)

As \(n \le \log _{5/4}n\) for \(n \le 5\), the result is immediate from induction. \(\square \)

We pause briefly to verify that the assumptions \(\delta < 1\), \(\epsilon < 1/2\), \(\mathsf {grid}\) has side lengths at most 9, and \(\Vert A\Vert \le 3.5\) in Theorem 5.5 ensure that every call to \(\mathsf {SPLIT}\) throughout the algorithm satisfies the hypotheses of Theorem 5.2, namely that \(\epsilon \le 0.5, \beta \le 0.05/n, \Vert A\Vert \le 4,\) and \(\mathsf {grid}\) has side lengths of at most 8. Since \(\delta ,\epsilon ,\) and \(\beta \) are non-increasing as we travel down the recursion tree of \(\mathsf {EIG}\)—with \(\beta \) monotonically decreasing in \(\delta \) and \(\epsilon \)—we need only verify that the hypotheses of Theorem 5.2 hold on the initial call to \(\mathsf {EIG}\). The condition on \(\epsilon \) is immediately satisfied; for the one on \(\beta \), we have

$$\begin{aligned} \beta = \frac{\eta ^4 \theta ^2}{(20 n)^6 \cdot 4 n^8} = \frac{\theta ^2\delta ^4 \epsilon ^8}{200^4 (20 n)^6 \cdot 4 n^8}, \end{aligned}$$

which is clearly at most 0.05/n.

On each new call to \(\mathsf {EIG}\) the grid only decreases in size, so the initial assumption is sufficient. Finally, we need that every matrix passed to \(\mathsf {SPLIT}\) throughout the course of the algorithm has norm at most 4. Lemma 5.11 shows that if \(\Vert A\Vert \le 4\) and has its \(\epsilon \)-pseudospectrum shattered, then \(\Vert \widetilde{A_{\pm }} - A_{\pm }\Vert \le \epsilon /5\), and since \(\Vert A_{\pm }\Vert = \Vert A\Vert \), this means \(\Vert \widetilde{A_{\pm }}\Vert \le \Vert A\Vert + \epsilon /5\). Thus each time we pass to a subproblem, the norm of the matrix we pass to \(\mathsf {EIG}\) (and thus to \(\mathsf {SPLIT}\)) increases by at most an additive \(\epsilon /5\), where \(\epsilon \) is the input to the outermost call to \(\mathsf {EIG}\). Since \(\epsilon \) decreases by a factor of 4/5 on each recursion step, this means that by the end of the algorithm the norm of the matrix passed to \(\mathsf {EIG}\) will increase by at most an additive \((\epsilon + (4/5)\epsilon + (4/5)^2 \epsilon + \cdots )/5 = \epsilon \le 1/2\). Thus we will be safe if our initial matrix has norm at most 3.5, as assumed.

Lemma 5.13

(Lower Bounds on the Parameters) Assume \(\mathsf {EIG}\) is run on an \(n\times n\) matrix, with some parameters \(\delta \) and \(\epsilon \). Throughout the algorithm, on every recursive call to \(\mathsf {EIG}\), the corresponding parameters \(\delta '\) and \(\epsilon '\) satisfy

$$\begin{aligned} \delta ' \ge \delta /n \qquad \epsilon ' \ge \epsilon /n. \end{aligned}$$

On each such call to \(\mathsf {EIG}\), the parameters \(\eta '\) and \(\beta '\) passed to \(\mathsf {SPLIT}\) and \(\mathsf {DEFLATE}\) satisfy

$$\begin{aligned} \eta ' \ge \frac{\delta \epsilon ^2}{200n^3} \qquad \beta ' \ge \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{26}}. \end{aligned}$$

Proof

Along each branch of the recursion tree, we replace \(\epsilon \leftarrow 4\epsilon /5\) and \(\delta \leftarrow 4\delta /5\) at most \(\log _{5/4}n\) times, so each can only decrease by a factor of n from their initial settings. The parameters \(\eta '\) and \(\beta '\) are computed directly from \(\epsilon '\) and \(\delta '\). \(\square \)

Lemma 5.14

(Failure Probability) \(\mathsf {EIG}\) fails with probability no more than \(\theta \).

Proof

Since each recursion splits into at most two subproblems, and the recursion tree has depth \(\log _{5/4}n\), there are at most

$$\begin{aligned} 2\cdot 2^{\log _{5/4}n} = 2n^{\frac{\log 2}{\log 5/4}} \le 2n^4 \end{aligned}$$

calls to \(\mathsf {DEFLATE}\). We have set every \(\eta \) and \(\beta \) so that the failure probability of each is \(\theta /2n^4\), so a crude union bound finishes the proof. \(\square \)

The arithmetic operations required for \(\mathsf {EIG}\) satisfy the recursive relationship

$$\begin{aligned}&T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n) \le T_{\mathsf {SPLIT}}(n,\epsilon ,\beta ) + T_{\mathsf {DEFLATE}}(n,\beta ,\eta ) + 2 T_{\mathsf {MM}}(n) \\&\qquad + T_{\mathsf {EIG}}(n_+,4\delta /5,\mathsf {g}_{+},4\epsilon /5,\theta ,n) + T_{\mathsf {EIG}}(n_-,4\delta /5,\mathsf {g}_{-},4\epsilon /5,\theta ,n) \\&\qquad + 2T_{\mathsf {MM}}(n) + O(n^2). \end{aligned}$$

All of \(T_{\mathsf {SPLIT}}\), \(T_{\mathsf {DEFLATE}}\), and \(T_{\mathsf {MM}}\) are of the form \(\mathrm {polylog}(n)\mathrm {poly}(n)\), with all coefficients nonnegative and exponents in the \(\mathrm {poly}(n)\) no smaller than 2. So, for any \(n_+ + n_- = n\) and \(n_{\pm } \ge 4 n/5\), holding all other parameters fixed, \(T_{\mathsf {SPLIT}}(n_+,...) + T_{\mathsf {SPLIT}}(n_-,...) \le \left( (4/5)^2 + (1/5)^2\right) T_{\mathsf {SPLIT}}(n,...) = (17/25)T_{\mathsf {SPLIT}}(n,...)\) and the same holds for \(T_{\mathsf {DEFLATE}}\) and \(T_{\mathsf {MM}}\). Applying this recursively, with all parameters other than n set to their lower bounds from Lemma 5.13, we then have

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n)&\le \frac{1}{1 - 17/25}\left( T_{\mathsf {SPLIT}}\left( n,\epsilon /n,\mathsf {g},\frac{\delta ^4\epsilon ^8 \theta ^2}{(5n)^{26}}\right) \right. \\&\quad + \left. T_{\mathsf {DEFLATE}}\left( n,\beta /n,\epsilon /n,\frac{\delta ^4\epsilon ^8 \theta ^2}{(5n)^{26}}\right) + 4T_{\mathsf {MM}}(n) + O(n^2 ) \right) \\&= \frac{25}{8}\left( 12 N_{\mathsf {EIG}} \lg \frac{1}{\omega (\mathsf {g})} \left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 2 T_{\mathsf {QR}}(n) \right. \\&\quad + 5 T_{\mathsf {MM}}(n) + n^2 T_{\mathsf {N}} + O(n^2)\bigg ) \\&\le 60 N_{\mathsf {EIG}}\lg \frac{1}{\omega (\mathsf {g})}\left( T_{\mathsf {INV}}(n) + O(n^2)\right) + 10 T_{\mathsf {QR}}(n) + 25 T_{\mathsf {MM}}(n), \end{aligned}$$

where

$$\begin{aligned} N_{\mathsf {EIG}} := \lg \frac{256 n}{\epsilon } + 3\lg \lg \frac{256 n}{\epsilon } + \lg \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59. \end{aligned}$$

In the above inequalities, we’ve substituted in the expressions for \(T_{\mathsf {SPLIT}}\) and \(T_{\mathsf {DEFLATE}}\) from Theorems 5.2 and 5.3, respectively; \(N_{\mathsf {EIG}}\) is defined by recomputing \(N_{\mathsf {SPLIT}}\) with the parameter lower bounds, and the \(\epsilon ^9\) is not an error. The final inequality uses our assumption \(T_\mathsf {N}= O(1)\). Thus using the fast and stable instantiations of \(\mathsf {MM}\), \(\mathsf {INV}\), and \(\mathsf {QR}\) from Theorem 2.10, we have

$$\begin{aligned} T_{\mathsf {EIG}}(n,\delta ,\mathsf {g},\epsilon ,\theta ,n) = O\left( \log \frac{1}{\omega (\mathsf {g})}\left( \log \frac{n}{\epsilon } + \log \log \frac{1}{\theta \delta }\right) T_{\mathsf {MM}}(n,{\textbf {u }})\right) ; \end{aligned}$$
(44)

exact constants can be extracted by analyzing \(N_{\mathsf {EIG}}\) and opening Theorem 2.10.

Required Bits of Precision. We will need the following bound on the norms of all spectral projectors.

Lemma 5.15

(Sizes of Spectral Projectors) Throughout the algorithm, every approximate spectral projector \({\widetilde{P}}\) given to \(\mathsf {DEFLATE}\) satisfies \(\Vert {\widetilde{P}}\Vert \le 10n/\epsilon \).

Proof

Every such \({\widetilde{P}}\) is \(\beta \)-close to a true spectral projector P of a matrix whose \(\epsilon /n\)-pseudosepctrum is shattered with respect to the initial \(8\times 8\) unit grid \(\mathsf {g}\). Since we can generate P by a contour integral around the boundary of a rectangular subgrid, we have

$$\begin{aligned} \Vert {\widetilde{P}}\Vert \le 2 + \Vert P\Vert \le 2 + \frac{32}{2\pi }\frac{n}{\epsilon } \le 10n/\epsilon , \end{aligned}$$

with the last inequality following from \(\epsilon < 1\). \(\square \)

Collecting the machine precision requirements \({\textbf {u }}\le {\textbf {u }}_{\mathsf {SPLIT}},{\textbf {u }}_{\mathsf {DEFLATE}}\) from Theorems 5.2 and 5.3, as well as those we used in the course of our proof so far, and substituting in the parameter lower bounds from Lemma 5.13, we need \({\textbf {u }}\) to satisfy

$$\begin{aligned} {\textbf {u }}&\le \min \left\{ \frac{\left( 1 - \frac{\epsilon }{256n}\right) ^{2^{N_{\mathsf {EIG}} + 1}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}, \right. \\&\qquad \left. \frac{\epsilon }{100 n^2}, \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{26}}\frac{1}{4\Vert {\widetilde{P}}\Vert \max \{\mu _{\mathsf {QR}}(n),\mu _{\mathsf {MM}}(n)\}}, \right. \\&\qquad \left. \frac{\delta \epsilon ^2}{100 n^3 \cdot 2\mu _{\mathsf {QR}}(n)}, \frac{\delta \epsilon ^2}{100 n^3 \max \{ 4\mu _{\mathsf {MM}}(n),n, 2\mu _{\mathsf {QR}}(n)\}} \right\} \end{aligned}$$

From Lemma 5.15, \(\Vert {\widetilde{P}}\Vert \le 10n/\epsilon \), so the conditions in the second two lines are all satisfied if we make the crass upper bound

$$\begin{aligned} {\textbf {u }}\le \frac{\theta ^2\delta ^4\epsilon ^8}{(5n)^{30}}\frac{1}{\max \{\mu _{\mathsf {QR}}(n),\mu _{MM}(n),n\}}, \end{aligned}$$
(45)

i.e. if \(\lg 1/{\textbf {u }}\ge O\left( \lg \frac{n}{\theta \delta \epsilon }\right) \). Unpacking the first requirement, using the definition \( N_{\mathsf {EIG}} := \lg \tfrac{256 n}{\epsilon } + 3\lg \lg \tfrac{256 n}{\epsilon } + \lg \lg \tfrac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^9} + 7.59\) from Theorem 5.5, and recalling that \(\epsilon \le 1/2\), \(n \ge 1\), and \((1 - x)^{1/x} \ge 1/4\) for \(x \in (0,1/512)\), we have

$$\begin{aligned} \frac{\left( 1 - \frac{\epsilon }{256n}\right) ^{2^{N_{\mathsf {EIG}} + 1}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}&= \frac{\left( \left( 1 - \frac{\epsilon }{256n}\right) ^{\frac{256n}{\epsilon }}\right) ^{\lg ^3\frac{256 n}{\epsilon } \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{8.59}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n}N_{\mathsf {EIG}}} \\&\ge \frac{4^{-\lg ^3\frac{256 n}{\epsilon } \lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{8.59}(c_{\mathsf {INV}}\log n + 3)}}{\mu _{\mathsf {INV}}(n)\sqrt{n} N_{\mathsf {EIG}}}, \end{aligned}$$

so setting \({\textbf {u }}\) smaller than the final expression is sufficient to guarantee \(\mathsf {EIG}\) and all subroutines can execute as advertised. This gives

$$\begin{aligned} \lg 1/{\textbf {u }}&\ge \lg ^3 \frac{n}{\epsilon }\lg \frac{(5n)^{26}}{\theta ^2\delta ^4\epsilon ^8}2^{9.59}(c_{\mathsf {INV}}\log n + 3) + \lg N_{\mathsf {EIG}} \\&= O\left( \log ^3\frac{n}{\epsilon }\log \frac{n}{\theta \delta \epsilon }\log n\right) \nonumber . \end{aligned}$$

This dominates the precision requirement from (45), and completes the proof of Theorem 5.5.

Remark 5.16

A constant may be extracted directly from the expression above—leaving \(\epsilon ,\delta ,\theta \) fixed, a crude bound on it is \(2^{9.59} \cdot 26 \cdot 8 \cdot c_{\mathsf {INV}} \approx 160303 c_{\mathsf {INV}}\). This can certainly be optimized, the improvement with the highest impact would be tighter analysis of \(\mathsf {SPLIT}\), with the aim of eliminating the additive 7.59 term in \(N_{\mathsf {SPLIT}}\).

6 Conclusion and Open Questions

In this paper, we reduced the approximate diagonalization problem to a polylogarithmic number of matrix multiplications, inversions, and QR factorizations on a floating point machine with precision depending only polylogarithmically on n and \(1/\delta \). The key phenomena enabling this were: (a) every matrix is \(\delta \)-close to a matrix with well-behaved pseudospectrum, and such a matrix can be found by a complex Gaussian perturbation and (b) the spectral bisection algorithm can be shown to converge rapidly to a forward approximate solution on such a well-behaved matrix, using a polylogarithmic in n and \(1/\delta \) amount of precision and number of iterations. The combination of these facts yields a \(\delta \)-backward approximate solution for the original problem.

Using fast matrix multiplication, we obtain algorithms with nearly optimal asymptotic computational complexity (as a function of n, compared to matrix multiplication), for general complex matrices with no assumptions. Using naïve matrix multiplication, we get easily implementable algorithms with \(O(n^3)\) type complexity and much better constants which are likely faster in practice. The constants in our bit complexity and precision estimates (see Theorem 5.5 and equations (41) and (42)), while not huge, are likely suboptimal. The reasonable practical performance of spectral bisection based algorithms is witnessed by the many empirical papers (see e.g. [5]) which have studied it. The more recent of these works further show that such algorithms are communication-avoiding and have good parallelizability properties.

Remark 6.1

(Hermitian Matrices) A curious feature of our algorithm is that even when the input matrix is Hermitian or real symmetric, it begins by adding a complex non-Hermitian perturbation to regularize the spectrum. If one is only interested in this special case, one can replace this first step by a Hermitian GUE or symmetric GOE perturbation and appeal to the result of [1] instead of Theorem 1.4, which also yields a polynomial lower bound on the minimum gap of the perturbed matrix. It is also possible to obtain a much stronger analysis of the Newton iteration in the Hermitian case, since the iterates are all Hermitian and \(\kappa _V=1\) for such matrices. By combining these observations, one can obtain a running time for Hermitian matrices which is significantly better (in logarithmic factors) than our main theorem. We do not pursue this further since our main goal was to address the more difficult non-Hermitian case.

We conclude by listing several directions for future research.

  1. 1.

    Devise a deterministic algorithm with similar guarantees. The main bottleneck to doing this is deterministically finding a regularizing perturbation, which seems quite mysterious. Another bottleneck is computing a rank-revealing QR factorization in near matrix multiplication time deterministically (all of the currently known deterministic algorithms require \(\Omega (n^3)\) time).

  2. 2.

    Determine the correct exponent for smoothed analysis of the eigenvalue gap of \(A+\gamma G\) where G is a complex Ginibre matrix. We currently obtain roughly \((\gamma /n)^{8/3}\) in Theorem 3.6. Is it possible to match the \(n^{-4/3}\) type dependence [64] which is known for a pure Ginibre matrix?

  3. 3.

    Reduce the dependence of the running time and precision to a smaller power of \(\log (1/\delta )\). The bottleneck in the current algorithm is the number of bits of precision required for stable convergence of the Newton iteration for computing the sign function. Other, “inverse-free” iterative schemes have been proposed for this, which conceivably require lower precision.

  4. 4.

    Study the convergence of “scaled Newton iteration” and other rational approximation methods (see [40, 48]) for computing the sign function on non-Hermitian matrices. Perhaps these have even faster convergence and better stability properties?

More broadly, we hope that the techniques introduced in this paper—pseudospectral shattering and pseudospectral analysis of matrix iterations using contour integrals—are useful in attacking other problems in numerical linear algebra.