On Randomized Trace Estimates for Indefinite Matrices with an Application to Determinants

Randomized trace estimation is a popular and well-studied technique that approximates the trace of a large-scale matrix B by computing the average of xTBx\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^T Bx$$\end{document} for many samples of a random vector X. Often, B is symmetric positive definite (SPD) but a number of applications give rise to indefinite B. Most notably, this is the case for log-determinant estimation, a task that features prominently in statistical learning, for instance in maximum likelihood estimation for Gaussian process regression. The analysis of randomized trace estimates, including tail bounds, has mostly focused on the SPD case. In this work, we derive new tail bounds for randomized trace estimates applied to indefinite B with Rademacher or Gaussian random vectors. These bounds significantly improve existing results for indefinite B, reducing the number of required samples by a factor n or even more, where n is the size of B. Even for an SPD matrix, our work improves an existing result by Roosta-Khorasani and Ascher (Found Comput Math, 15(5):1187–1212, 2015) for Rademacher vectors. This work also analyzes the combination of randomized trace estimates with the Lanczos method for approximating the trace of f(B). Particular attention is paid to the matrix logarithm, which is needed for log-determinant estimation. We improve and extend an existing result, to not only cover Rademacher but also Gaussian random vectors.


Introduction
This paper is concerned with approximating the trace of a symmetric matrix B ∈ R n×n that is accessible only implicitly via matrix-vector products or, more precisely, (approximate) quadratic forms.If X is a random vector of length n such that E[X] = 0 and E[XX T ] = I, then E[X T BX] = tr(B).Based on this result, a stochastic trace estimator [23] is obtained from sampling an average of N quadratic forms: where X (i) , i = 1, . . ., N , are independent copies of X.The most common choices for X are standard Gaussian and Rademacher random vectors.The latter are defined by having i.i.d.entries that take values ±1 with equal probability.We will consider both choices in this paper and denote the resulting trace estimates by tr G N (B) and tr R N (B), respectively.Hutchinson [23] used tr R N (B) to approximate the trace of the influence matrix of Laplacian smoothing splines.In this setting, B = A −1 for a symmetric positive definite (SPD) matrix A and, in turn, A is SPD as well.Other applications, such as spectral density estimation [27] and determinant computation [7], may feature a symmetric but indefinite matrix B. For approximating the determinant, one exploits the relation log(det(A)) = tr(log(A)), (2) where log(A) denotes the matrix logarithm of A. The need for estimating determinants arises, for instance, in statistical learning [4,16,18], lattice quantum chromodynamics [33], and Markov random fields models [36].Certain quantities associated with graphs can be formulated as determinants, such as the number of spanning trees and triangles, and various negative approximation results exist in this context; see, e.g., [5,14,15,29].Relying on the Cholesky factorization, the exact computation of the determinant is often infeasible for a large matrix A. In contrast, the Hutchinson estimator combined with (2) bypasses the need for factorizing A and instead requires to (approximately) evaluate the quadratic form x T log(A)x for several vectors x ∈ R n .Compared to the task of estimating the trace of A −1 , the determinant computation via (2) is complicated by two issues: (a) Even when A is SPD, the matrix B = log(A) may be indefinite; and (b) the quadratic forms x T log(A)x themselves are expensive to compute exactly, so they need to be approximated.We mention in passing that there are other methods to approximate determinants, including subspace iteration [31] and block Krylov methods [26], but they only work well in specific cases, e.g., when A = σI + C for a matrix C of low numerical rank.By the central limit theorem, the estimate (1) can be expected to become more reliable as N increases; see, e.g., [13,Corollaries 3.3 and 4.3] for such an asymptotic result as N → ∞.Most existing non-asymptotic results for trace estimation are specific to an SPD matrix B; see [6,20,30] for examples.They provide a bound on the estimated number N of probe vectors to ensure a small relative error with high probability: see Remark 7 below for a specific example.As already mentioned, the assumption that B is SPD is usually not met when computing the determinant of an SPD matrix A via tr(log(A)) because this would require all eigenvalues of A to be larger than one.For general indefinite B, it is unrealistic to aim at a bound of the form (3) for the relative error, because tr(B) = 0 does not imply zero error.Ubaru, Chen, and Saad [34] instead derive a bound for the absolute error via rescaling, that is, the results from [30] are applied to the matrix C := − log(λA) for a value of λ > 0 that ensures C to be SPD.Specifically, for Rademacher vectors it is shown in [34,Corollary 4.5] that is satisfied with fixed failure probability δ if the number of samples N grow at most quadratically with n.Unfortunately, this estimated number of samples compares unfavorably with a much simpler approach; computing the trace from the diagonal elements of log(A) only requires the evaluation of n quadratic forms, using all n unit vectors of length n.
To approximate the quadratic forms x T log(A)x, a polynomial approximation of the logarithm can be used, see [21,28] for approximation by Chebyshev expansion/interpolation and [8,12,38] for approximation by Taylor series expansion.Often, a better approximation can be obtained by the Lanczos method, which is equivalent to applying Gaussian quadrature to the integral log(x)dµ(x) on the spectral interval of A, for a suitably defined measure µ; see [19].In this case, upper and lower bounds for the quantity x T log(A)x can be determined without much additional effort [7].Moreover, the convergence of Gaussian quadrature for the quadratic form can be related to the best polynomial approximation of the logarithm on the spectral interval of A; see [34,Theorem 4.2].By combining the polynomial approximation error with (4), one obtains a total error bound that takes into account both sources of errors.Such a result is presented in [34,Corollary 4.5] for Rademacher vectors; the fact that all such vectors have bounded norm is essential in the analysis.
In this paper, we improve the results from [34] by first showing that the number of samples required to achieve (4) is much lower.In particular, we show for a general symmetric matrix B that is satisfied with fixed failure probability δ if the number of samples N grows proportionally with the stable rank ρ(B) := B 2 F / B 2 2 ; as ρ(B) ∈ [1, n], the growth is at most linear in n (instead of quadratic).We derive such a result for both, Gaussian and Rademacher vectors, and demonstrate that the dependence on n is asymptotically tight with an explicit example.For SPD matrices, our bound also improves the state-of-the-art result [30, Theorem 1] for Rademacher vectors by establishing that the number of probe vectors is inversely proportional to the stable rank.
Specialized to determinant computation, we combine our results with an improved analysis of the Lanczos method, to get a sharper total error bound for Rademacher vectors.Finally, we extend this combined error bound to Gaussian vectors, which requires some additional consideration because of the unboundedness of such vectors.We remark that some of our results are potentially of wider interest, beyond stochastic trace and determinant estimation, such as a tail bound for Rademacher chaos (Theorem 8) and an error bound (Corollary 15 combined with Corollary 23) on the polynomial approximation of the logarithm.

Tail bounds for trace estimates
In this section we derive tail bounds of the form (5) for the stochastic trace estimator applied to a symmetric, possibly indefinite matrix B ∈ R n×n .We will analyze Gaussian and Rademacher vectors separately.In the following, we will frequently use a spectral decomposition B = QΛQ T , where Λ = diag(λ 1 , . . ., λ n ) contains the eigenvalues of B and Q is an orthogonal matrix.

Standard Gaussian random vectors
The case of Gaussian vectors will be addressed by using a tail bound for sub-Gamma random variables, which follows from Chernoff bounds; see, e. g., [11].
Lemma 2 ([11, Section 2.4]).Let X be a sub-Gamma random variable with parameters (ν, c).Then, for all ε ≥ 0, we have Lemma 3 ([35, Proposition 2.10]).Let X be a random variable such that E[X] = 0, and such that both X and −X are sub-Gamma with parameters (ν, c).Then, for all ε ≥ 0, we have Lemma 3 implies the following result for the tail of a single-sample trace estimate.This result is similar, but not identical, to [11,Example 2.12] and [25, Lemma 1], which apply to symmetric matrices with zero diagonal and SPD matrices, respectively.Lemma 4. For a Gaussian vector X of length n we have for all ε > 0.
Theorem 5. Let B ∈ R n×n be symmetric.Then for all ε > 0. In particular, for Proof.We apply Lemma 4 to the matrix that is, the block diagonal matrix with the N diagonal blocks containing rescaled copies of B. In turn, the trace estimate (1) equals X T BX for a Gaussian vector X of length We recall that the stable rank of B is defined as In particular, ρ(B) = 1 when B has rank one and ρ(B) = n when all singular values are equal.Intuitively, ρ(B) tends to be large when B has many singular values not significantly smaller than the largest one.The minimum number of probe vectors required by Theorem 5 depends on the stable rank of B in the following way: The upper bound indicates that N may need to be chosen proportionally with n to reach a fixed (absolute) accuracy ε with constant success probability, provided that B 2 remains constant as well.The following lemma shows for a simple matrix B that such a linear growth of N can actually not be avoided.
Lemma 6.Let n be even and consider the traceless matrix . Then, for every ε > 0, it holds that Proof.By the definition of B, the trace estimate takes the form where X, Y ∼ χ 2 nN 2 are independent Chi-squared random variables with nN 2 degrees of freedom.The probability density function f of Z = X − Y can be expressed as where is a modified Bessel function of the second kind [1].In particular, where we used the duplication formula for Gamma functions and the inequality 1 ; see [2].As f is an autocorrelation function (of the density function of a Chi-squared variable with nN/2 degrees of freedom), its maximum is at 0. We can therefore estimate the probability of X − Y being in the interval [−N ε, N ε] in the following way: , where B is defined as in (7) and X is a Gaussian vector of length nN , is sub-Gamma with parameters 2 , and the same holds for −X T BX.By Lemma 2 we have As the example in Lemma 6 shows, the potential growth of ε with √ n cannot be avoided in general.Figure 1 illustrates this growth.Also the dependence of N on log 2 δ and 1 ε 2 in Theorem 5 cannot be improved, as shown in [37,Theorem 3].
Remark 7.For a nonzero symmetric positive semidefinite (SPSD) matrix B, Theorem 5 yields the following relative error estimate: State-of-the-art results of a similar form are Theorem 3 in [30], which requires N ≥ 8 ε 2 µ log 2 δ , with µ := B 2 tr(B) ≤ 1, and Corollary 3.3 in [20], which requires Compared to [20], our result imposes no restriction on ε at the expense of a somewhat larger constant.On the other hand, because µ can be much smaller than ρ −1 , the result from [30] is more favorable than ours for SPSD matrices.

Rademacher random vectors
The quadratic form X T BX for a Rademacher vector X is called Rademacher chaos of order 2. We will first consider the homogeneous case, corresponding to a matrix B with zero diagonal, which has been studied extensively in the literature, see, e.g., [11,17,22,24,32].The non-homogeneous case is easily obtained from the homogeneous case; see Corollary 10 below.We make use of the the entropy method [11] to establish the following tail bound for a single-sample trace estimate.
Theorem 8. Let X be a Rademacher vector of length n and let B be a nonzero symmetric matrix such that B ii = 0 for i = 1, . . ., n.Then, for all ε > 0, Proof.The proof follows closely [3,Theorem 6] and [11,Theorem 17]; see Remark 9 for a comparison with these results.The main idea of the proof is as follows.Using the logarithmic Sobolev inequalities discussed in the appendix, a bound on the entropy of the random variable X T BX is obtained.Using a (modified) Herbst argument, we derive a bound on the moment generating function (MGF) of X T BX, establishing that it is sub-Gamma with certain constants, which then allows us to apply Lemma 3. Without loss of generality, we may assume B 2 = 1; the general case follows from applying the result to B := B/ B 2 .Let us consider the function We want to apply the logarithmic Sobolev inequality (22) from Theorem 21 to f (X).For this purpose, we let where e i denotes the ith unit vector.Using that B has zero diagonal entries, we obtain Therefore, denoting Theorem 21 establishes, for all λ > 0, The decoupling inequality in [17,Lemma 8.50], which follows from Jensen's inequality, gives Combined with (11), this implies To find an upper bound on the MGF of Y , we use again a logarithmic Sobolev inequality, then transform the obtained bound on the entropy into a bound on the MGF by Herbst argument.We do so by applying the inequality (21) from Theorem 21 to the function h : R n → R defined by h(x) := Bx 2 2 .For this purpose, note that and, hence, Therefore Theorem 21 gives Inserting this inequality into (12) gives The random variable f (X) satisfies (20) for the function g(λ) := where we used log(1 + x) ≤ x in the last inequality.Replacing f by −f and B by −B, we also obtain log Therefore the random variables f (X) and −f (X) are sub-Gamma with parameters (4 B 2 F , 4).Applying Lemma 3 concludes the proof.Remark 9.The proof of Theorem 8 follows the proof of [3,Theorem 6], which in turn refines a result from [10, Theorem 17] (see also [11]) by substituting the more general logarithmic Sobolev inequality from [10, Proposition 10] with the ones from Theorem 21 specific for Rademacher random variables.However, let us stress that the results in [3,10] feature larger constants partly because they deal with the more general Rademacher chaos where B is a set of symmetric matrices with zero diagonal.Restricted to the case B = {B}, the results stated in [3,Theorem 6] and [11,Exercise 6.9] give F +128 B 2 ε , respectively.Proposition 8.13 in [17], which also aims at the more general (13), states As for Gaussian vectors, the result of Theorem 8 can be turned into a tail bound for tr R N (B) by block diagonal embedding.In the following, let D B denote the diagonal matrix containing the diagonal entries of B.
Corollary 10.Let B be a nonzero symmetric matrix.Then CX for a Rademacher vector X of length N n.The matrix C has zero diagonal, C F = N −1/2 , and C 2 = N −1 C 2 .Now, the first part of the corollary directly follows from Theorem 8. Imposing a failure probability of δ in (10) gives and hence Remark 11.It is instructive to compare the result of Corollary 10 to the straightforward application of Bernstein's inequality, which gives Clearly, a disadvantage of this bound is the explicit dependence of the denominator on n, which does not appear in Corollary 10.
An alternative expression for the lower bound on N is obtained by noting that B − D B F ≤ B F and B − D B 2 ≤ 2 B 2 (the factor 2 in the latter inequality is asymptotically tight, see, e.g., [9]).The result of Corollary 10 thus states that N needs to be at least in the following way: where ρ is the stable rank of B. In analogy to the Gaussian case, the following lemma shows that a potential linear dependence of N on n cannot be avoided in general.
Lemma 12. Let n be even and consider the traceless matrix for every ε > 0.
Proof.We first note that tr R equals the probability that the number of variables satisfying Z i = 1 is at least n−ε 4 N and at most n+ε 4 N .Therefore, where we used the inequality 1 We do not report a figure analogous to Figure 1 because the observed errors are very similar to the Gaussian case.
For SPSD matrices, a relative error estimate follows from Corollary 10 similarly to what has been discussed in Remark 7 for Gaussian vectors.We recall that ρ denotes the stable rank of B.
Corollary 13.For a nonzero SPSD matrix B, we have Proof.First of all, it is immediate that B − D B F ≤ B F .As shown, e.g., in [9, Theorem 4.1], the same holds for the spectral norm when B is SPSD.For convenience, we provide a short proof: For every y ∈ R n it holds that where the first inequality uses that both y T By and y T D B y are nonnegative.By taking the maximum with respect to all vectors of norm 1 one obtains B − D B 2 on the left-hand side, which shows that it is bounded by B 2 .Now, Corollary 10 implies that The proof is concluded by noting that Corollary 13 improves the result from [30, Theorem 1], which requires N ≥ 6 ε 2 log 2 δ ; a lower bound that does not improve as the stable rank of B increases.

Lanczos method to approximate quadratic forms
Let us now consider the problem of estimating the log determinant through log(det(A)) = tr(log(A)), or more generally the problem of computing the trace of f (A) for an analytic function f .Applying a trace estimator to tr(f (A)) requires the (approximate) computation of the quadratic forms x T f (A)x for fixed vectors x ∈ R n .Following [34], we use the Lanczos method, Algorithm 1, for this purpose.
Algorithm 1 Lanczos method to approximate quadratic form x T f (A)x Input: Matrix A ∈ R n×n , nonzero vector x ∈ R n , number of iterations m Output: Approximation of x T f (A)x 1: Initialize u 1 ← x/ x 2 and β 0 ← 0 2: for i = 1, . . ., m do 3: For theoretical considerations, it is helpful to view the quadratic form as an integral: where λ min , λ max are the smallest/largest eigenvalues of A and µ is the measure defined by a spectral decomposition A = QΛQ T , Λ = diag(λ 1 , . . ., λ n ).It is well known [19,Theorem 6.2] that the approximation I m returned by the m-points Gaussian quadrature rule applied to I is identical to the approximation returned by m steps of the Lanczos method: To bound the error |I − I m |, the analysis in [34] proceeds by using existing results on the polynomial approximation error of analytic functions.Although our analysis is along the same lines, it differs in a key technical aspect; we derive and use an improved error bound for the approximation of the logarithm; see Corollary 16.We have also noted two minor erratas in [34]; see the proof of Theorem 14 and the remark after Corollary 15 for details.
Proof.As in [34], this result follows directly from bounds on the polynomial approximation error of analytic functions via Chebyshev expansion, combined with the fact that m-points Gaussian quadrature is exact for polynomials up to degree 2m − 1.However, the proof of [34,Theorem 4.2] uses an extra ingredient, which seems to be wrong.It claims that the integration error for odd-degree Chebyshev polynomials is zero thanks to symmetry.While this fact is indeed true for the standard Lebesgue measure, it does not hold for the measure (14).In turn, one obtains the slightly worse factor 1 − ρ −1 in the denominator, compared to the factor 1 − ρ −2 that would have been obtained from [34,Theorem 4.2] translated into our setting.

The affine linear transformation
By its shift invariance, the Lanczos method with g, ϕ(A), and x returns the approximation e T 1 g(ϕ(T m ))e 1 .This allows us to apply Theorem 14.Combined with the relations (15), the following result is obtained.

Corollary 15. With the notation introduced above, it holds that
Note that M ρ is the maximum of g on E ρ , which is equal to the maximum of f on the transformed ellipse with foci λ min , λ max , and elliptical radius (λ max − λ min )ρ/2.The result of Corollary 15 differs from the corresponding result in [34, page 1087], which features an additional, erroneous factor (λ max (A) − λ min (A))/2.
For the special case of the logarithm, the following result is obtained.
Corollary 16.Let A ∈ R n×n be SPD with condition number κ(A), f ≡ log and x ∈ R n \{0}.
Then the error of the Lanczos method after m steps satisfies Proof.The proof consists of applying Corollary 15 to a rescaled matrix.More specifically, we choose B := λA with λ := 1/(2λ min ) > 0. The tridiagonal matrix returned by the Lanczos method with A replaced by B satisfies T B m = λT m .Together with the identity log(λA) = log λI + log(A), this implies Note that the smallest/largest eigenvalues of B are given by 1/2 and κ(A)/2, respectively.

Combined bounds for determinant estimation
Combining randomized trace estimation with the Lanczos method, we obtain the following (stochastic) estimate for log(det(A)):

Because of log √
, condition (ii) ensures that this inequality is satisfied.
Applying Corollary 10 to log(A) and with ε replaced by ε/2, immediately shows with probability at least 1−δ if condition (i) is satisfied.The proof is concluded by applying the triangle inequality.
Comparison with existing result.To compare Theorem 18 with an existing result from [34], it is helpful to first derive a simpler (but usually stronger) condition on N .
An application of Corollary 10 to log(B) therefore yields (17) with probability at least 1 − δ for N ≥ 8ε −2 n log 2 κ(A) + 2ε log κ(A) log 2 δ .Correcting for the two minor erratas explained above, the result from [34,Corollary 4.5] states that P and Lemma 19 improves (18) roughly by a factor n.However, let us stress that often the lower bound from Theorem 18 can be expected to be even better; below we describe a situation in which the lower only depends logarithmically on n.Condition (ii) of Theorem 18 improves (19) clearly but less drastically, roughly by a factor √ 3.
A situation leading to low stable ranks.We consider a family of matrices {A n } of increasing dimension, such that the condition number and the spectral norm remain bounded.For a fixed failure probability δ and a fixed accuracy ε, the number of matrixvector multiplications required to get , where ρ log is the stable rank of log(A n ).In certain applications, including regularized kernel matrices (see, e.g., [18]), also the stable rank remains constant when the matrix size increases.For example, this is the case for A n := I + B n where B n has exponentially decaying eigenvalues, that is, λ i (B n ) ≤ Cα i for some constant C > 0 and 0 < α < 1, for all i < n.For all n we have log Therefore, the number of required matrix-vector multiplications required to attain accuracy ε with probability 1 − δ is O(log n).
A Auxiliary results

A.1 Herbst argument and logarithmic Sobolev inequalities
This section contains auxiliary results used in the proof of Theorem 8. We recall that the entropy of a random variable Z is defined as provided that all the involved expected values exist.The Herbst argument (see, e.g., [11, page 11], [17, pages 239-240], and [35, Section 3.1.2])turns a bound on the entropy of a random variable into a bound on the moment generating function.By Chernoff's bound, the latter implies a bound on the tail of the random variable.Specifically, we use the following (modified) Herbst argument.We conclude by noting that lim λ→0 + ψ λ = E[Z].
For deriving bounds on the entropy, we need the following two variations of Gross' logarithmic Sobolev inequality.
The inequality ( 22) is a variation of the same argument; see also [11,Exercise 5.5] for a related (but not identical) result.The inequality (22) can, in fact, be found in a Master's thesis [3,Theorem 5].For convenience of the reader, we provide a proof of ( 22) based on the textbook [11].