1 Introduction

This paper is concerned with approximating the trace of a symmetric matrix \(B \in {\mathbb {R}}^{n \times n}\) that is accessible only implicitly via matrix-vector products or, more precisely, (approximate) quadratic forms. If X is a random vector of length n such that \({\mathbb {E}}[X] = 0\) and \({\mathbb {E}}[XX^T] = I\), then \({\mathbb {E}}[X^T B X] = \,\mathrm {tr}(B)\). Based on this result, a stochastic trace estimator [27] is obtained from sampling an average of N quadratic forms:

$$\begin{aligned} \,\mathrm {tr}_N(B) := \frac{1}{N} \sum _{i=1}^N (X^{(i)})^T B X^{(i)}, \end{aligned}$$
(1)

where \(X^{(i)}\), \(i = 1,\ldots , N\), are independent copies of X. The most common choices for X are standard Gaussian and Rademacher random vectors. The latter are defined by having i.i.d. entries that take values \(\pm 1\) with equal probability. We will consider both choices in this paper and denote the resulting trace estimates by \(\,\mathrm {tr}^{\mathsf {G}}_N(B)\) and \(\,\mathrm {tr}^{\mathsf {R}}_N(B)\), respectively.

Hutchinson [27] used \(\,\mathrm {tr}^{\mathsf {R}}_N(B)\) to approximate the trace of the influence matrix of Laplacian smoothing splines. In this setting, \(B = A^{-1}\) for a symmetric positive definite (SPD) matrix A and, in turn, A is SPD as well. Other applications, such as spectral density estimation [31], triangle counting in graphs [3, 17], and determinant computation [5], may feature a symmetric but indefinite matrix B. For approximating the determinant, one exploits the relation

$$\begin{aligned} \log (\det (A)) = \,\mathrm {tr}(\log (A)), \end{aligned}$$
(2)

where \(\log (A)\) denotes the matrix logarithm of A. The need for estimating determinants arises, for instance, in statistical learning [2, 18, 20], lattice quantum chromodynamics [39], and Markov random fields models [43]. Certain quantities associated with graphs can be formulated as determinants, such as the number of spanning trees, and various negative approximation results exist in this context; see, e.g., [16, 35]. Relying on the Cholesky factorization, the exact computation of the determinant is often infeasible for a large matrix A. In contrast, the Hutchinson estimator combined with (2) bypasses the need for factorizing A and instead requires to (approximately) evaluate the quadratic form \(x^T \log (A) x\) for several vectors \(x \in {\mathbb {R}}^n\). Compared to the task of estimating the trace of \(A^{-1}\), the determinant computation via (2) is complicated by two issues: (a) Even when A is SPD, the matrix \(B = \log (A)\) may be indefinite; and (b) the quadratic forms \(x^T \log (A) x\) themselves are expensive to compute exactly, so they need to be approximated. We mention in passing that there are other methods to approximate traces and determinants, including randomized subspace iteration [37] and block Krylov methods [30], but they only work well in specific cases, e.g., when \(A = \sigma I + C\) for a matrix C of low numerical rank. The Hutch++ trace estimator, recently proposed and analyzed for the SPD case in [32], overcomes this limitation via a combination with stochastic trace estimation. Although it is not difficult to imagine that the results presented in this work are useful in extending the analysis from [32] to the indefinite case, a thorough discussion of this extension is beyond the scope of this work. Another direction of work on large-scale determinant estimation has explored the use of spectral sparsifiers for symmetric diagonally dominant matrices [16, 26].

Trace Estimation of Indefinite Matrices. By the central limit theorem, estimate (1) can be expected to become more reliable as N increases; see, e.g., [13, Corollaries 3.3 and 4.3] for such an asymptotic result as \(N \rightarrow \infty \). Most existing non-asymptotic results for trace estimation are specific to an SPD matrix B; see [4, 22, 36] for examples. They provide a bound on the estimated number N of probe vectors to ensure a small relative error with high probability:

$$\begin{aligned} {\mathbb {P}} \left( \left|\frac{\,\mathrm {tr}(B) - \,\mathrm {tr}^{\mathsf {G,R}}_N(B)}{\,\mathrm {tr}(B)} \right|\ge \varepsilon \right) \le \delta ; \end{aligned}$$
(3)

see Remark 2 for a specific example. As already mentioned, the assumption that B is SPD is usually not met when computing the determinant of an SPD matrix A via \(\,\mathrm {tr}(\log (A))\) because this would require all eigenvalues of A to be larger than one. For general indefinite B, it is unrealistic to aim at a bound of form (3) for the relative error, because \(\,\mathrm {tr}(B) = 0\) does not imply zero error. Ubaru, Chen, and Saad [40] derive a bound for the absolute error via rescaling, that is, the results from [36] are applied to the matrix \(C := -\log (\lambda A)\) for a value of \(\lambda > 0\) that ensures C to be SPD. Specifically, for Rademacher vectors it is shown in [40, Corollary 4.5] that

$$\begin{aligned} {\mathbb {P}}\left( | \,\mathrm {tr}^{\mathsf {R}}_N(\log (A)) - \log \det (A) | \ge \varepsilon \right) \le \delta \end{aligned}$$
(4)

is satisfied with fixed failure probability \(\delta \) if the number of samples N grows proportionally to \(\varepsilon ^{-2} n^2 \log (1 + \kappa (A))^2 \log \frac{2}{\delta }\) where \(\kappa (A)\) denotes the condition number of A. Unfortunately, this estimated number of samples compares unfavorably with a much simpler approach; computing the trace from the diagonal elements of \(\log (A)\) only requires the evaluation of n quadratic forms, using all n unit vectors of length n. A more general result for indefinite matrices is shown in [3] and it also features a worst-case dependence on \(n^2\); we refer to Remarks 1 and 5 for a comparison with our new results.

Approximation of Quadratic Forms. To approximate the quadratic forms \({x^T B x} = x^T \log (A) x\), a polynomial approximation of the logarithm can be used, see [24, 34] for approximation by Chebyshev expansion/interpolation and [6, 10, 45] for approximation by Taylor series expansion. Often, a better approximation can be obtained by the Lanczos method, which is equivalent to applying Gaussian quadrature to the integral \(\int \log ({\lambda }) \mathrm{d}\mu ({\lambda })\) on the spectral interval of A, for a suitably defined measure \(\mu \); see [21]. In this case, upper and lower bounds for the quantity \(x^T \log (A) x\) can be determined without much additional effort [5]. Moreover, the convergence of Gaussian quadrature for the quadratic form can be related to the best polynomial approximation of the logarithm on the spectral interval of A; see [40, Theorem 4.2]. By combining the polynomial approximation error with (4), one obtains a total error bound that takes into account both sources of errors. Such a result is presented in [40, Corollary 4.5] for Rademacher vectors; the fact that all such vectors have bounded norm is essential in the analysis.

Contributions. In this paper, we improve the results from [3, 40] by first showing that the number of samples required to achieve (4) is much lower. In particular, we show for a general symmetric matrix B that

$$\begin{aligned} {\mathbb {P}}\big ( | \,\mathrm {tr}^{\mathsf {G,R}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon \big ) \le \delta \end{aligned}$$
(5)

is satisfied with fixed failure probability \(\delta \) if the number of samples N grows proportionally with the stable rank \(\rho (B) := \Vert B \Vert _F^2 / \Vert B \Vert _2^2\); as \(\rho (B) \in [1, n]\), the growth is at most linear in n (instead of quadratic). We derive such a result for both, Gaussian and Rademacher vectors, and demonstrate that the dependence on n is asymptotically tight with an explicit example. For SPD matrices B, our bound also improves the state-of-the-art result [36, Theorem 1] for Rademacher vectors by establishing that the number of probe vectors is inversely proportional to the stable rank of \(B^{1/2}\).

Specialized to determinant computation, we combine our results with an improved analysis of the Lanczos method, to get a sharper total error bound for Rademacher vectors. Finally, we extend this combined error bound to Gaussian vectors, which requires some additional consideration because of the unboundedness of such vectors. We remark that some of our results are potentially of wider interest, beyond stochastic trace and determinant estimation, such as a tail bound for Rademacher chaos (Theorem 2) and an error bound (Corollary 3 combined with Corollary 5) on the polynomial approximation of the logarithm.

We note in passing that some results of this paper also apply to a non-symmetric matrix B, because of the relations \(\,\mathrm {tr}(B) = \,\mathrm {tr}(B_s)\) and \(x^T B x = x^T B_{s} x\) with the symmetric part \(B_s = (B+B^T)/2\).

2 Tail Bounds for Trace Estimates

In this section we derive tail bounds of the form (5) for the stochastic trace estimator applied to a symmetric, possibly indefinite matrix \(B \in {\mathbb {R}}^{n\times n}\). We will analyze Gaussian and Rademacher vectors separately. In the following, we will frequently use a spectral decomposition \(B = Q\Lambda Q^T\), where \(\Lambda = \,\mathrm {diag}(\lambda _1, \ldots , \lambda _n)\) contains the eigenvalues of B and Q is an orthogonal matrix.

2.1 Standard Gaussian Random Vectors

The case of Gaussian vectors will be addressed by using a tail bound for sub-Gamma random variables, which follows from Chernoff bounds; see, e. g., [9].

Definition 1

A random variable X is called sub-Gamma with variance parameter \(\nu > 0\) and scale parameter \(c > 0\) if

$$\begin{aligned} {\mathbb {E}}[\exp (\lambda X)] \le \exp \left( \frac{\nu \lambda ^2}{2(1-c\lambda )} \right) \qquad \text {for all } 0< \lambda < \frac{1}{c}. \end{aligned}$$

Lemma 1

[9, Section 2.4] Let X be a sub-Gamma random variable with parameters \((\nu , c)\). Then, for all \(\varepsilon \ge 0\), we have

$$\begin{aligned} {\mathbb {P}}(X \ge \sqrt{2\varepsilon \nu } + c\varepsilon ) \le \exp (-\varepsilon ). \end{aligned}$$

Lemma 2

[42, Proposition 2.10] Let X be a random variable such that \({\mathbb {E}}[X] = 0\), and such that both X and \(-X\) are sub-Gamma with parameters \((\nu , c)\). Then, for all \(\varepsilon \ge 0\), we have

$$\begin{aligned} {\mathbb {P}} \left( |X| \ge \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{2(\nu + c\varepsilon )} \right) . \end{aligned}$$

Lemma 2 implies the following result for the tail of a single-sample trace estimate. This result is similar, but not identical, to [9, Example 2.12] and [29, Lemma 1], which apply to symmetric matrices with zero diagonal and SPD matrices, respectively.

Lemma 3

For a Gaussian vector X of length n we have

$$\begin{aligned} {\mathbb {P}} \left( |X^T B X - \,\mathrm {tr}(B)| \ge \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{4\Vert B\Vert _F^2 + 4\varepsilon \Vert B\Vert _2} \right) \end{aligned}$$

for all \(\varepsilon >0\).

Proof

We let

$$\begin{aligned} Y := X^T B X - \,\mathrm {tr}(B) = X^T Q \Lambda Q^T X - \,\mathrm {tr}(B) = \sum _{i=1}^n \lambda _i(Z_i^2-1), \end{aligned}$$

where \(Z_i \sim \mathcal {N}(0,1)\) is the ith component of the Gaussian vector \(Q^T X\). To show that Y is sub-Gamma, we define for \(\lambda \in {\mathbb {R}}\) the function

$$\begin{aligned} \psi (\lambda ) := \log {\mathbb {E}} [ \exp (\lambda (Z^2-1))], \quad Z \sim \mathcal {N}(0,1). \end{aligned}$$

By direct computation, it follows that \(\psi (\lambda ) = -\lambda - \frac{1}{2} \log (1-2\lambda )\) for \(\lambda <\frac{1}{2}\). In particular, this implies \(\psi (\lambda ) \le \frac{\lambda ^2}{1-2\lambda }\) for \(0 \le \lambda < \frac{1}{2}\), and \(\psi (\lambda ) \le \lambda ^2 \le \frac{\lambda ^2}{1 + c\lambda }\) for \(-\frac{1}{c}< \lambda < 0\) for all \(c > 0\). Using the independence of \(Z_i\) for different i we obtain

$$\begin{aligned} \log {\mathbb {E}}[\exp (\lambda Y)]&= \sum _{i=1}^n \log {\mathbb {E}} [\exp (\lambda \lambda _i(Z_i^2-1))] = \sum _{i=1}^n \psi (\lambda \lambda _i) \\&\le \sum _{i=1}^n \frac{\lambda _i^2 \lambda ^2}{1-2|\lambda _i| \lambda } \le \frac{\Vert B \Vert _F^2 \lambda ^2}{1-2\Vert B \Vert _2 \lambda } \end{aligned}$$

for \(0< \lambda < \frac{1}{2\Vert B \Vert _2}\). This shows that Y is sub-Gamma with parameters \((\nu ,c) = (2\Vert B \Vert _F^2, 2\Vert B \Vert _2)\). Moreover, \(-Y = X^T (-B) X - \,\mathrm {tr}(-B)\) is also sub-Gamma with the same parameters. Because \({\mathbb {E}}[Y]=0\), Lemma 2 implies the desired result.\(\square \)

A diagonal embedding trick turns Lemma 3 into a tail bound for the stochastic trace estimator (1).

Theorem 1

Let \(B\in {\mathbb {R}}^{n\times n}\) be symmetric. Then

$$\begin{aligned} {\mathbb {P}}\left( | \,\mathrm {tr}^{\mathsf {G}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon \right) \le 2\exp \left( -\frac{N\varepsilon ^2}{4 \Vert B \Vert _F^2 + 4 \varepsilon \Vert B \Vert _2}\right) \end{aligned}$$

for all \(\varepsilon > 0\). In particular, for \(N \ge \frac{4}{\varepsilon ^2}(\Vert B \Vert _F^2 + \varepsilon \Vert B \Vert _2) \log \frac{2}{\delta }\) it holds that \({\mathbb {P}}(| \,\mathrm {tr}^{\mathsf {G}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon ) \le \delta \).

Proof

We apply Lemma 3 to the matrix

$$\begin{aligned} \mathcal {B} := \,\mathrm {diag}\big ( N^{-1} B,\ldots , N^{-1} B\big ) \in {\mathbb {R}}^{Nn\times Nn}, \end{aligned}$$
(6)

that is, the block diagonal matrix with the N diagonal blocks containing rescaled copies of B. In turn, the trace estimate (1) equals \(X^T \mathcal {B} X\) for a Gaussian vector X of length Nn. Noting that \(\Vert \mathcal {B} \Vert _F = N^{-1/2} \Vert B \Vert _F\) and \(\Vert \mathcal {B} \Vert _2 = N^{-1} \Vert B \Vert _2\), the first part of the corollary follows from Lemma 3. Setting

$$\begin{aligned} \delta := 2\exp \left( -\frac{\varepsilon ^2}{4 \Vert \mathcal {B} \Vert _F^2 + 4 \varepsilon \Vert \mathcal {B} \Vert _2}\right) = 2\exp \left( -\frac{N\varepsilon ^2}{4 \Vert B \Vert _F^2 + 4 \varepsilon \Vert B \Vert _2}\right) \end{aligned}$$

we obtain \( N = \frac{4}{\varepsilon ^2} \left( \Vert B \Vert _F^2 + \varepsilon \Vert B \Vert _2 \right) \log \frac{2}{\delta }. \) \(\square \)

Remark 1

The result of Theorem 1 compares favorably with Lemma 4 in [3], which shows that \({\mathbb {P}}(| \,\mathrm {tr}^{\mathsf {G}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon ) \le \delta \) for \(N \ge \frac{20}{\varepsilon ^2} \Vert B \Vert _*^2 \log \frac{4}{\delta }\). Because of \(\Vert B \Vert _F \le \Vert B \Vert _* \le \sqrt{n} \Vert B \Vert _F\), the bound of Theorem 1 is always better for reasonable values of \(\varepsilon \), and it can improve the estimated number of samples N in [3] by a factor proportional to n.

We recall that the stable rank of B is defined as \(\rho = \Vert B \Vert _F^2 / \Vert B \Vert _2^2\) and satisfies \(\rho \in [1,n]\). In particular, \(\rho (B) = 1\) when B has rank one and \(\rho (B) = n\) when all singular values are equal. Intuitively, \(\rho (B)\) tends to be large when B has many singular values not significantly smaller than the largest one. The minimum number of probe vectors required by Theorem 1 depends on the stable rank of B in the following way:

$$\begin{aligned} \frac{4}{\varepsilon ^2}(\rho \Vert B \Vert _2^2 + \varepsilon \Vert B \Vert _2) \log \frac{2}{\delta } \le \frac{4}{\varepsilon ^2}(n\Vert B \Vert _2^2 + \varepsilon \Vert B \Vert _2) \log \frac{2}{\delta }. \end{aligned}$$

The upper bound indicates that N may need to be chosen proportionally with n to reach a fixed (absolute) accuracy \(\varepsilon \) with constant success probability, provided that \(\Vert B \Vert _2\) remains constant as well. The following lemma shows for a simple matrix B that such a linear growth of N can actually not be avoided.

Lemma 4

Let n be even and consider the traceless matrix \(B = \left[ \begin{array}{cc} I_{\frac{n}{2}} &{} 0 \\ 0 &{} -I_{\frac{n}{2}} \end{array}\right] \). Then, for every \(\varepsilon > 0\), it holds that

$$\begin{aligned} {\mathbb {P}}(| \,\mathrm {tr}^{\mathsf {G}}_N(B)| \le \varepsilon ) \le \varepsilon \sqrt{\frac{N}{\pi n}}. \end{aligned}$$

Proof

By the definition of B, the trace estimate takes the form

$$\begin{aligned} \,\mathrm {tr}^{\mathsf {G}}_N(B) = \frac{1}{N} \Big ( \sum _{i=1}^{nN/2} X_i^2 - \sum _{j=1}^{nN/2} Y_j^2 \Big ) \end{aligned}$$

for independent \(X_i, Y_j \sim N(0,1)\). In other words,

$$\begin{aligned} N \cdot \,\mathrm {tr}^{\mathsf {G}}_N(B) = X - Y, \end{aligned}$$

where XY are independent Chi-squared random variables with \(\frac{nN}{2}\) degrees of freedom. The probability density function f of \(Z = X-Y\) can be expressed as

$$\begin{aligned} f(z) = \frac{1}{2^{nN/4} \sqrt{\pi } \Gamma (nN/4)} |z|^{\frac{nN}{4}-\frac{1}{2}} K_{\frac{nN}{4}-\frac{1}{2}}(|z|), \end{aligned}$$

where \(K_{\frac{nN}{4}-\frac{1}{2}}\) is a modified Bessel function of the second kind [15]. In particular,

$$\begin{aligned} f(0) = \frac{1}{4\sqrt{\pi }} \frac{\Gamma \left( \frac{nN}{4} - \frac{1}{2} \right) }{\Gamma \left( \frac{nN}{4} \right) } = \frac{1}{4\sqrt{\pi }} \frac{\sqrt{\pi } }{2^{\frac{nN}{2}-2}}\left( {\begin{array}{c}\frac{nN}{2}-2\\ \frac{nN}{4}-1\end{array}}\right) \le \frac{1}{2\sqrt{\pi nN}}, \end{aligned}$$

where we used the duplication formula for Gamma functions and the inequality \(\frac{1}{2^{2k}}\left( {\begin{array}{c}2k\\ k\end{array}}\right) \le \frac{1}{\sqrt{\pi k}}\); see [41].

As f is an autocorrelation function (of the density function of a Chi-squared variable with nN/2 degrees of freedom), its maximum is at 0. We can therefore estimate the probability of \(X-Y\) being in the interval \([-N\varepsilon , N\varepsilon ]\) in the following way:

$$\begin{aligned} {\mathbb {P}}(|\,\mathrm {tr}^{\mathsf {G}}_N(B)| \le \varepsilon ) = {\mathbb {P}}(|X-Y| \le N\varepsilon ) \le 2N\varepsilon f(0) \le \varepsilon \sqrt{\frac{N}{\pi n}}. \end{aligned}$$

\(\square \)

We can reformulate Theorem 1 in such a way that, given a number N of probe vectors and a failure probability \(\delta \in (0,1)\), we have \(\varepsilon = \varepsilon (B, N, \delta )\) such that with probability at least \(1-\delta \) one has \(\,\mathrm {tr}^{\mathsf {G}}_N(B) \in [\,\mathrm {tr}(B) - \varepsilon , \,\mathrm {tr}(B) + \varepsilon ]\). The random variable \(X^T \mathcal {B} X - \,\mathrm {tr}({\mathcal {B}})\), where \(\mathcal {B}\) is defined as in (6) and X is a Gaussian vector of length nN, is sub-Gamma with parameters \(\left( 2\frac{\Vert B\Vert _F^2}{N}, 2\frac{\Vert B\Vert _2}{N} \right) \), and the same holds for \(-X^T \mathcal {B} X\). By Lemma 1 we have

$$\begin{aligned} \varepsilon \equiv \varepsilon (B, \delta , N) = \frac{2}{\sqrt{N}} \Vert B \Vert _F \sqrt{\log \frac{2}{\delta }} + \frac{2}{N} \Vert B \Vert _2 \log \frac{2}{\delta } \le \Big ( 2 \sqrt{\frac{n}{N}\log \frac{2}{\delta }} + \frac{2}{N} \log \frac{2}{\delta }\Big ) \Vert B \Vert _2. \end{aligned}$$
(7)

As the example in Lemma 4 shows, the potential growth of \(\varepsilon \) with \(\sqrt{n}\) cannot be avoided in general. Figure 1 illustrates this growth. In the case of relative error estimates for symmetric positive semidefinite (SPSD) matrices, it is shown in [44] that the dependence on \(\log \frac{2}{\delta }\) and \(\frac{1}{\varepsilon ^2}\) cannot be improved.

Fig. 1
figure 1

Asterisks: Errors \(|\,\mathrm {tr}^{\mathsf {G}}_{10}(B) - \,\mathrm {tr}(B)|\) of 100 samples for each \(n = 2^k\) with \(k = 2, \ldots , 23\), where B is the matrix from Lemma 4. Blue line: Error bound \(\varepsilon (B,0.01,10)\) from (7)

Remark 2

For a nonzero SPSD matrix B, the result of Theorem 1 can be turned into a relative error estimate. Let \(\mu := \Vert B \Vert _2 / \,\mathrm {tr}(B) = \rho (B^{1/2})^{-1}\). Replacing \(\varepsilon \) by \(\varepsilon \cdot \,\mathrm {tr}(B)\) in Theorem 1 and noting that \(\Vert B \Vert _F^2 / \,\mathrm {tr}(B)^2 \le \mu \), one obtains

$$\begin{aligned} {\mathbb {P}} \left( \frac{|\,\mathrm {tr}^{\mathsf {G}}_N(B)-\,\mathrm {tr}(B)|}{\,\mathrm {tr}(B)} \ge \varepsilon \right) \le \delta \quad \text { for } N \ge \frac{4}{\varepsilon ^2}(1 + \varepsilon )\mu \log \frac{2}{\delta }. \end{aligned}$$

State-of-the-art results of a similar form are Theorem 3 in [36], which requires \(N \ge \frac{8}{\varepsilon ^2} \mu \log \frac{2}{\delta }\), and Corollary 3.3 in [22], which requires \(N \ge \frac{2}{\varepsilon ^2} {\mu } \log \frac{2}{\delta }\) and \(\varepsilon \in \left( 0, \frac{1}{2}\right) \). Compared to [22], our result imposes no restriction on \(\varepsilon \) at the expense of a somewhat larger constant. On the other hand, as \(\varepsilon \le 1\), our result is always more favorable than the result from [36] for SPSD matrices.

2.2 Rademacher Random Vectors

The quadratic form \(X^T B X\) for a Rademacher vector X is called Rademacher chaos of order 2. We will first consider the homogeneous case, corresponding to a matrix B with zero diagonal, which has been studied extensively in the literature, see, e.g., [9, 19, 25, 28, 38]. The non-homogeneous case is easily obtained from the homogeneous case; see Corollary 1. We make use of the the entropy method [9] to establish the following tail bound for a single-sample trace estimate.

Theorem 2

Let X be a Rademacher vector of length n and let B be a nonzero symmetric matrix such that \(B_{ii} = 0\) for \(i=1, \ldots , n\). Then, for all \(\varepsilon > 0\),

$$\begin{aligned} {\mathbb {P}} \left( | X^T B X | \ge \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{8\Vert B\Vert _F^2 + 8\varepsilon \Vert B\Vert _2 } \right) . \end{aligned}$$
(8)

Proof

The proof follows closely [1, Theorem 6] and [9, Theorem 17]; see Remark 3 for a comparison with these results. The main idea of the proof is as follows. Using the logarithmic Sobolev inequalities discussed in the appendix, a bound on the entropy of the random variable \(X^T B X\) is obtained. Using a (modified) Herbst argument, we derive a bound on the moment generating function (MGF) of \(X^T B X\), establishing that it is sub-Gamma with certain constants, which then allows us to apply Lemma 2.

Without loss of generality, we may assume \(\Vert B \Vert _2 = 1\); the general case follows from applying the result to \({\tilde{B}} := B / \Vert B\Vert _2\). Let us consider the function \(f:\{-1,1\}^n \rightarrow {\mathbb {R}}\) defined as

$$\begin{aligned} f(x) = x^T B x = \sum _{i \ne j} x_i x_j B_{ij}. \end{aligned}$$

We want to apply the logarithmic Sobolev inequality (20) from Theorem 6 to f(X). For this purpose, we let

$$\begin{aligned} \bar{X}^{(i)} = \big [ X_1, \ldots , X_{i-1}, -X_i, X_{i+1}, \ldots , X_n \big ]^T = X - 2 X_i e_i, \quad i=1, \ldots , n, \end{aligned}$$

where \(e_i\) denotes the ith unit vector. Using that B has zero diagonal entries, we obtain

$$\begin{aligned} f(X) - f(\bar{X}^{(i)}) = \langle B X, X\rangle - \langle BX - 2 X_i B e_i, X - 2 X_i e_i \rangle = 4 X_i \langle B e_i, X \rangle . \end{aligned}$$

Therefore, denoting

$$\begin{aligned} Y := \Vert BX\Vert _2^2 = \frac{1}{16} \sum _{i=1}^n \Big ( \sum _{j=1}^n B_{ij}X_j\Big )^2 , \end{aligned}$$

We denote by \({\mathbb {H}}(Z)\) the entropyFootnote 1 of a random variable Z. Theorem 6 establishes, for all \(\lambda > 0\),

$$\begin{aligned} \,{\mathbb {H}}(\exp (\lambda f(X))) \le 2\lambda ^2 {\mathbb {E}} \left[ Y \exp (\lambda f(X)) \right] . \end{aligned}$$
(9)

The decoupling inequality in [19, Lemma 8.50], which follows from Jensen’s inequality, gives

$$\begin{aligned} \lambda {\mathbb {E}}[Y\exp (\lambda f(X))] \le \,{\mathbb {H}}(\exp (\lambda f(X))) + {\mathbb {E}}[\exp (\lambda f(X))]\log {\mathbb {E}}[\exp (\lambda Y)]. \end{aligned}$$

Combined with (9), this implies

$$\begin{aligned} \,{\mathbb {H}}(\exp (\lambda f(X))) \le \frac{2\lambda }{1-2\lambda } {\mathbb {E}}[\exp (\lambda f(X))] \cdot \log {\mathbb {E}}[\exp (\lambda Y )] \text { for } 0< \lambda < \frac{1}{2}. \end{aligned}$$
(10)

To find an upper bound on the MGF of Y, we use again a logarithmic Sobolev inequality, then transform the obtained bound on the entropy into a bound on the MGF by Herbst argument. We do so by applying inequality (19) from Theorem 6 to the function \(h: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) defined by \(h(x) := \Vert B x \Vert _2^2\). For this purpose, note that

$$\begin{aligned} h(X) - h(\bar{X}^{(i)})= & {} \langle B X, BX \rangle - \langle B \bar{X}^{(i)}, B \bar{X}^{(i)} \rangle = \langle B (X- \bar{X}^{(i)}), B (X+ \bar{X}^{(i)}) \rangle \\= & {} 4 \langle X_i B e_i, BX - X_i B e_i \rangle \le 4 X_i \langle B e_i, BX \rangle \end{aligned}$$

and, hence,

$$\begin{aligned} \sum _{i=1}^n \left( h(X) - h(\bar{X}^{(i)})\right) _+^2 \le 16 \sum _{i=1}^n \langle B e_i, BX \rangle ^2 = 16 \Vert B^T BX\Vert _2^2 \le 16 \Vert BX\Vert _2^2. \end{aligned}$$

Therefore, Theorem 6 gives

$$\begin{aligned} \,{\mathbb {H}}(\exp (\lambda Y)) \le 4 \lambda ^2 {\mathbb {E}} [Y \exp ( \lambda Y)]. \end{aligned}$$

Letting \(g(\lambda ) := 4{\mathbb {E}}[Y \exp (\lambda Y)] /{\mathbb {E}}[ \exp (\lambda Y)]\), we have obtained a bound of form (18), as required by Lemma 7. Note that \(g(\lambda ) = 4 \psi '(\lambda )\), where \(\psi (\lambda ) := \log {\mathbb {E}}[\exp (\lambda Y)]\). The result of Lemma 7 gives

$$\begin{aligned} \log {\mathbb {E}}[\exp (\lambda Y)] \le \frac{\lambda }{1-4\lambda } \Vert B \Vert _F^2 \,\,\text { for }\lambda \in \left( 0, \frac{1}{4}\right) . \end{aligned}$$

Inserting this inequality into (10) gives

$$\begin{aligned} \,{\mathbb {H}}(\exp (\lambda f(X))) \le \frac{2\lambda ^2 \Vert B \Vert _F^2}{(1-4\lambda )(1-2\lambda )} {\mathbb {E}}[\exp (\lambda f(X))] \,\,\text { for }\lambda \in \left( 0, \frac{1}{4}\right) . \end{aligned}$$

The random variable f(X) satisfies (18) for the function \(g(\lambda ) := \frac{2\Vert B\Vert _F^2}{(1-4\lambda )(1-2\lambda )}\) in the interval [0, 1/4). Recalling that \({\mathbb {E}}[f(X)]=0\), the result of Lemma 7 gives

$$\begin{aligned} \log {\mathbb {E}}[\exp (\lambda f(X))] \le \lambda \Vert B \Vert _F^2 \log \frac{1-2\lambda }{1-4\lambda } \le \frac{2\Vert B \Vert _F^2 \lambda ^2}{1-4\lambda }, \quad \lambda \in [0, 1/4), \end{aligned}$$

where we used \(\log (1+x) \le x\) in the last inequality.

Replacing f by \(-f\) and B by \(-B\), we also obtain

$$\begin{aligned} \log {\mathbb {E}}[\exp (- \lambda f(X))] \le \frac{2\Vert B \Vert _F^2 \lambda ^2}{1-4\lambda }, \quad \lambda \in [0, 1/4). \end{aligned}$$

Therefore, the random variables f(X) and \(-f(X)\) are sub-Gamma with parameters \(( 4\Vert B\Vert _F^2, 4)\). Applying Lemma 2 concludes the proof.\(\square \)

Remark 3

The proof of Theorem 2 follows the proof of [1, Theorem 6], which in turn refines a result from [8, Theorem 17] (see also [9]) by substituting the more general logarithmic Sobolev inequality from [8, Proposition 10] with the ones from Theorem 6 specific for Rademacher random variables. However, let us stress that the results in [1, 8] feature larger constants partly because they deal with the more general Rademacher chaos

$$\begin{aligned} f(X) = \sup _{B \in \mathcal {B}} \sum _{i\ne j} X_i X_j B_{ij}, \end{aligned}$$
(11)

where \(\mathcal {B}\) is a set of symmetric matrices with zero diagonal. Restricted to the case \(\mathcal {B} = \{B\}\), the results stated in [1, Theorem 6] and [9, Exercise 6.9] give \({\mathbb {P}} \left( | X^T B X | \ge \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{16\Vert B\Vert _F^2 + 16 \Vert B\Vert _2 \varepsilon } \right) \) and \({\mathbb {P}} \left( | X^T B X | \ge \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{32\Vert B\Vert _F^2 + 128 \Vert B\Vert _2 \varepsilon } \right) \), respectively. Proposition 8.13 in [19] states \({\mathbb {P}} \left( | X^T B X | \ge \varepsilon \right) \le 2\exp \left( -\min \left\{ \frac{3\varepsilon ^2}{128 \Vert {B} \Vert _F^2}, \frac{\varepsilon }{32 \Vert {B} \Vert _2} \right\} \right) \).

As for Gaussian vectors, the result of Theorem 2 can be turned into a tail bound for \(\,\mathrm {tr}^{\mathsf {R}}_N(B)\) by block diagonal embedding. In the following, let \(\,\mathrm {D}_B\) denote the diagonal matrix containing the diagonal entries of B.

Corollary 1

Let B be a nonzero symmetric matrix. Then

$$\begin{aligned} {\mathbb {P}}\big ( | \,\mathrm {tr}^{\mathsf {R}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon \big ) \le 2\exp \left( -\frac{N\varepsilon ^2}{8 \Vert B - \,\mathrm {D}_B \Vert _F^2 + 8 \varepsilon \Vert B - \,\mathrm {D}_B \Vert _2}\right) \end{aligned}$$

for every \(\varepsilon > 0\). In particular, for

$$\begin{aligned} N \ge \frac{8}{\varepsilon ^2} \left( \Vert B - \,\mathrm {D}_B \Vert _F^2 + \varepsilon \Vert B - \,\mathrm {D}_B \Vert _2 \right) \log \frac{2}{\delta } \end{aligned}$$

it holds that \({\mathbb {P}}\big (| \,\mathrm {tr}^{\mathsf {R}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon \big ) \le \delta \).

Proof

Let \(C := B - \,\mathrm {D}_B\) and \(\mathcal {C} := \,\mathrm {diag}\big ( N^{-1} C,\ldots , N^{-1} C\big ) \in {\mathbb {R}}^{Nn\times Nn}\). Then, \(\,\mathrm {tr}^{\mathsf {R}}_N(B) - \,\mathrm {tr}(B) = X^T \mathcal {C} X\) for a Rademacher vector X of length Nn.

The matrix \(\mathcal {C}\) has zero diagonal, \(\Vert \mathcal {C} \Vert _F = N^{-1/2} {\Vert C \Vert _F}\), and \(\Vert \mathcal {C} \Vert _2 = N^{-1} \Vert C\Vert _2\). Now, the first part of the corollary directly follows from Theorem 2. Imposing a failure probability of \(\delta \) in (8) gives

$$\begin{aligned} \delta := 2\exp \left( -\frac{\varepsilon ^2}{8 \Vert \mathcal {C} \Vert _F^2 + 8 \varepsilon \Vert \mathcal {C} \Vert _2}\right) = 2\exp \left( -\frac{N\varepsilon ^2}{8 \Vert C \Vert _F^2 + 8 \varepsilon \Vert C \Vert _2}\right) , \end{aligned}$$

and hence \( N = \frac{8}{\varepsilon ^2} \left( \Vert C \Vert _F^2 + \varepsilon \Vert C \Vert _2 \right) \log \frac{2}{\delta } \). \(\square \)

Remark 4

It is instructive to compare the result of Corollary 1 to the straightforward application of Bernstein’s inequality, which gives

$$\begin{aligned} {\mathbb {P}}\left( | \,\mathrm {tr}^{\mathsf {R}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon \right) \le 2\exp \left( -\frac{N\varepsilon ^2}{4 \Vert B - \,\mathrm {D}_B \Vert _F^2 + \frac{4}{3} n \varepsilon \Vert B - \,\mathrm {D}_B \Vert _2}\right) . \end{aligned}$$

Clearly, a disadvantage of this bound is the explicit dependence of the denominator on n, which does not appear in Corollary 1.

An alternative expression for the lower bound on N is obtained by noting that \(\Vert B - \,\mathrm {D}_B\Vert _F \le \Vert B\Vert _F\) and \(\Vert B - \,\mathrm {D}_B\Vert _2 \le 2 \Vert B \Vert _2\) (the factor 2 in the latter inequality is asymptotically tight, see, e.g., [7]). The result of Corollary 1 thus states that N needs to be at least in the following way:

$$\begin{aligned} \frac{8}{\varepsilon ^2}(\rho \Vert B \Vert _2^2 + 2 \varepsilon \Vert B \Vert _2) \log \frac{2}{\delta } \le \frac{8}{\varepsilon ^2}(n \Vert B \Vert _2^2 + 2 \varepsilon \Vert B \Vert _2) \log \frac{2}{\delta }, \end{aligned}$$

where \(\rho \) is the stable rank of B.

Remark 5

In analogy to the Gaussian case (see Remark 1), the result of Corollary 1 compares favorably with Lemma 5 in [3], which shows that \({\mathbb {P}}(| \,\mathrm {tr}^{\mathsf {R}}_N(B) - \,\mathrm {tr}(B) | \ge \varepsilon ) \le \delta \) for \(N \ge \frac{6}{\varepsilon ^2} \Vert B \Vert _*^2 \log \frac{2\cdot \mathrm {rank}(B)}{\delta }\).

In analogy to the Gaussian case, the following lemma shows that a potential linear dependence of N on n cannot be avoided in general.

Lemma 5

Let n be even and consider the traceless matrix . Then

$$\begin{aligned} {\mathbb {P}} \left( |\,\mathrm {tr}^{\mathsf {R}}_N(B)| \le \varepsilon \right) \le \varepsilon \sqrt{\frac{N}{\pi n}} \end{aligned}$$

for every \(\varepsilon > 0\).

Proof

We first note that \(\,\mathrm {tr}^{\mathsf {R}}_N(B) = \frac{2}{N} \sum _{i=1}^{nN/2} Z_i\) with independent Rademacher random variables \(Z_i\). In turn, \( {\mathbb {P}} \left( |\,\mathrm {tr}^{\mathsf {R}}_N(B)| \le \varepsilon \right) = {\mathbb {P}} \left( \left|\sum _{i=1}^{nN/2} Z_i \right|\le \frac{N\varepsilon }{2} \right) \) equals the probability that the number of variables satisfying \(Z_i = 1\) is at least \(\frac{n-\varepsilon }{4}N\) and at most \(\frac{n+\varepsilon }{4}N\). Therefore,

$$\begin{aligned} {\mathbb {P}} \big ( |\,\mathrm {tr}^{\mathsf {R}}_N(B)| \le \varepsilon \big )&= \frac{1}{2^{nN/2}} \sum _{i=\lceil \frac{n-\varepsilon }{4}N\rceil }^{\lfloor \frac{n + \varepsilon }{4}N\rfloor } \left( {\begin{array}{c}nN/2\\ i\end{array}}\right) \le \frac{N\varepsilon }{2} \cdot \frac{1}{2^{nN/2}}\cdot \left( {\begin{array}{c}nN/2\\ nN/4\end{array}}\right) \\&\le \frac{N\varepsilon }{2} \cdot \frac{2}{\sqrt{\pi nN}} = \varepsilon \sqrt{\frac{N}{\pi n}}, \end{aligned}$$

where we used the inequality \(\frac{1}{2^{2k}}\left( {\begin{array}{c}2k\\ k\end{array}}\right) \le \frac{1}{\sqrt{\pi k}}\). \(\square \)

We do not report a figure analogous to Fig. 1 because the observed errors are very similar to the Gaussian case.

For SPSD matrices, a relative error estimate follows from Corollary 1 similarly to what has been discussed in Remark 2 for Gaussian vectors. We recall that \(\rho = \Vert B \Vert _F^2 / \Vert B \Vert _2^2\) denotes the stable rank of B.

Corollary 2

For a nonzero SPSD matrix B, we have

$$\begin{aligned} {\mathbb {P}} \left( \frac{|\,\mathrm {tr}^{\mathsf {R}}_N(B)-\,\mathrm {tr}(B)|}{\,\mathrm {tr}(B)} \ge \varepsilon \right) \le \delta \quad \text {for} \quad N \ge \frac{8}{\varepsilon ^2}(1 + \varepsilon ){\mu } \log \frac{2}{\delta }, \quad {\text {where} \quad \mu := \frac{\Vert B\Vert _2}{\,\mathrm {tr}(B)}}. \end{aligned}$$

Proof

First of all, it is immediate that \(\Vert B - \,\mathrm {D}_B \Vert _F \le \Vert B \Vert _F\). As shown, e.g., in [7, Theorem 4.1], the same holds for the spectral norm when B is SPSD. For convenience, we provide a short proof: For every \(y\in {\mathbb {R}}^n\) it holds that

$$\begin{aligned} | y^T (B - \,\mathrm {D}_B) y| \le \max \{y^T B y, y^T \,\mathrm {D}_B y\} \le \max \{ \Vert B \Vert _2, \Vert \,\mathrm {D}_B \Vert _2 \} \le \Vert B \Vert _2, \end{aligned}$$

where the first inequality uses that both \(y^T B y\) and \(y^T \,\mathrm {D}_B y\) are nonnegative. By taking the maximum with respect to all vectors of norm 1 one obtains \(\Vert B - \,\mathrm {D}_B \Vert _2\) on the left-hand side, which shows that it is bounded by \(\Vert B \Vert _2\). Now, the claimed result follows from Corollary 1 using the arguments of Remark 2.\(\square \)

Corollary 2 improves the result from [36, Theorem 1], which requires \(N \ge \frac{6}{\varepsilon ^2} \log \frac{2}{\delta }\); a lower bound that does not improve as \(\mu \) decreases.

3 Lanczos Method to Approximate Quadratic Forms

Let us now consider the problem of estimating the log determinant through \(\log (\det (A)) = \,\mathrm {tr}(\log (A))\), or more generally the problem of computing the trace of f(A) for an analytic function f.

Applying a trace estimator to \(\,\mathrm {tr}(f(A))\) requires the (approximate) computation of the quadratic forms \(x^T f(A) x\) for fixed vectors \(x \in {\mathbb {R}}^n\). Following [40], we use the Lanczos method, Algorithm 1, for this purpose.

figure a

For theoretical considerations, it is helpful to view the quadratic form as an integral. For this purpose, we consider the spectral decomposition \(A = Q\Lambda Q^T\), \(\Lambda = \,\mathrm {diag}(\lambda _1, \ldots , \lambda _n)\), with \(\lambda _{\min } = \lambda _1 \le \lambda _2 \le \cdots \le \lambda _n = \lambda _{\max }\). Then

$$\begin{aligned} x^T f(A) x = I:= \int _{\lambda _{\min }}^{\lambda _{\max }} f(\lambda ) \,\text {d}\mu (\lambda ), \end{aligned}$$

with the piecewise constant measure

$$\begin{aligned} \mu (\lambda ) := \sum _{i=1}^n z_i^2 \chi _{[\lambda _i,\infty )}(\lambda ),\quad z := Q^T x, \end{aligned}$$
(12)

where \(\chi \) denotes the indicator function. It is well-known [21, Theorem 6.2] that the approximation \(I_m\) returned by the m-points Gaussian quadrature rule applied to I is identical to the approximation returned by m steps of the Lanczos method:

$$\begin{aligned} I_m := \Vert x \Vert _2^2 \cdot {e_1^T f(T_m) e_1}. \end{aligned}$$

To bound the error \(|I - I_m|\), the analysis in [40] proceeds by using existing results on the polynomial approximation error of analytic functions. Although our analysis is along the same lines, it differs in a key technical aspect; we derive and use an improved error bound for the approximation of the logarithm; see Corollary 4. We have also noted two minor erratas in [40]; see the proof of Theorem 3 and the remark after Corollary 3 for details.

Theorem 3

Let \(f:[-1,1] \rightarrow {\mathbb {R}}\) admit an analytic continuation to a Bernstein ellipse \(E_{\rho _0}\) with foci \(\pm 1\) and elliptical radius \(\rho _0\). For \(1< \rho < \rho _0\), let \(M_{\rho }\) be the maximum of |f(z)| on \(E_{\rho }\). Then

$$\begin{aligned} | I - I_m | \le \Vert x \Vert _2^2 \cdot \frac{4 M_{\rho }}{1-\rho ^{-1}} \rho ^{-2m}. \end{aligned}$$

Proof

As in [40], this result follows directly from bounds on the polynomial approximation error of analytic functions via Chebyshev expansion, combined with the fact that m-points Gaussian quadrature is exact for polynomials up to degree \(2m-1\). However, the proof of [40, Theorem 4.2] uses an extra ingredient, which seems to be wrong. It claims that the integration error for odd-degree Chebyshev polynomials is zero thanks to symmetry. While this fact is indeed true for the standard Lebesgue measure, it does not hold for the measure (12). In turn, one obtains the slightly worse factor \(1-\rho ^{-1}\) in the denominator, compared to the factor \(1-\rho ^{-2}\) that would have been obtained from [40, Theorem 4.2] translated into our setting. \(\square \)

The affine linear transformation

$$\begin{aligned} \varphi :[\lambda _{\min }, \lambda _{\max }] \rightarrow [-1, 1], \qquad x \mapsto \frac{2}{\lambda _{\max } - \lambda _{\min }}t - \frac{\lambda _{\max } + \lambda _{\min }}{\lambda _{\max } - \lambda _{\min }}, \end{aligned}$$

is used to map an interval \([\lambda _{\min }, \lambda _{\max }]\) containing the eigenvalues of A to the interval \([-1,1]\) of Theorem 3. Defining \(g:= f \circ \varphi ^{-1}\), one has

$$\begin{aligned} x^T g(\varphi (A)) x = x^T f(A) x, \quad e_1^T g(\varphi (T_m))e_1 = e_1^T f(T_m) e_1. \end{aligned}$$
(13)

By its shift and scaling invariance, the Lanczos method with g, \(\varphi (A)\), and x returns the approximation \(e_1^T g(\varphi (T_m))e_1\). This allows us to apply Theorem 3. Combined with relations (13), the following result is obtained.

Corollary 3

With the notation introduced above, it holds that

$$\begin{aligned} \left|x^T f(A) x - \Vert x\Vert _2^2 \cdot e_1^T f(T_m) e_1 \right|\le \Vert x \Vert _2^2 \cdot \frac{4 M_{\rho }}{1-\rho ^{-1}} \rho ^{-2m}, \end{aligned}$$

Note that \(M_\rho \) is the maximum of g on \(E_\rho \), which is equal to the maximum of f on the transformed ellipse with foci \(\lambda _{\min }, \lambda _{\max }\), and elliptical radius \((\lambda _{\max }-\lambda _{\min }) \rho /2\). The result of Corollary 3 differs from the corresponding result in [40, page 1087], which features an additional, erroneous factor \((\lambda _{\max }(A) - \lambda _{\min }(A))/2\).

For the special case of the logarithm, the following result is obtained.

Corollary 4

Let \(A \in {\mathbb {R}}^{n \times n}\) be SPD with condition number \(\kappa (A)\), \(f\equiv \log \) and \(x \in {\mathbb {R}}^n \backslash \{0\}\). Then the error of the Lanczos method after m steps satisfies

$$\begin{aligned} | x^T \log (A) x - \Vert x\Vert _2^2 \cdot e_1^T \log (T_m) e_1 | \le {c_A} \Vert x\Vert _2^2\left( \frac{\sqrt{\kappa (A)+1}-1}{\sqrt{\kappa (A)+1}+1} \right) ^{2m}. \end{aligned}$$

where \(c_A := 2 (\sqrt{\kappa (A)+1}+1) \log (2\kappa (A))\).

Proof

The proof consists of applying Corollary 3 to a rescaled matrix. More specifically, we choose \(B := \lambda A\) with \(\lambda := 1/ ( 2\lambda _{\min } ) > 0\). The tridiagonal matrix returned by the Lanczos method with A replaced by B satisfies \(T^B_m = \lambda T_m\). Together with the identity \(\log (\lambda A) = \log \lambda I + \log (A)\), this implies

$$\begin{aligned} x^T \log (A) x - \Vert x\Vert _2^2\cdot e_1^T \log (T_m) e_1 = x^T \log (B) x - \Vert x\Vert _2^2 \cdot e_1^T \log (T_m^{B}) e_1. \end{aligned}$$

Note that the smallest/largest eigenvalues of B are given by 1/2 and \(\kappa (A)/2\), respectively. Applying Corollary 3 to B withFootnote 2\(\rho := \frac{\sqrt{\kappa (A)+1}+1}{\sqrt{\kappa (A) +1}-1}\) thus gives

$$\begin{aligned} | x^T \log (A) x - \Vert x\Vert _2^2\cdot e_1^T \log (T_m) e_1 | \le \Vert x \Vert _2^2 \cdot \frac{4 M_{\rho }}{1-\rho ^{-1}} \rho ^{-2m}. \end{aligned}$$

The constant \(M_{\rho }\) is the maximum absolute value of the logarithm on the ellipse with foci 1/2 and \(\kappa (A)/2\) that intersects the real axis at \(\alpha := \frac{1}{2\kappa (A)}\) and \(\beta := \frac{\kappa (A)^2 + \kappa (A) - 1}{2\kappa (A)}\). By Corollary 5, \(M_\rho = |\log (\alpha )| = \log (2 \kappa (A))\), where we used \(\alpha \le 1/\beta \le 1\). Noting that

$$\begin{aligned} \frac{4 M_{\rho }}{1-\rho ^{-1}} = 2 (\sqrt{\kappa (A)+1}+1) \log (2\kappa (A))\, {= c_A} \end{aligned}$$

concludes the proof. \(\square \)

4 Combined Bounds for Determinant Estimation

Combining randomized trace estimation with the Lanczos method, we obtain the following (stochastic) estimate for \(\log (\det (A))\):

$$\begin{aligned} \,\mathrm {est}^{\mathsf {G,R}}_{N,m} := \sum _{i=1}^N \Vert X^{(i)} \Vert _2^2 \cdot e_1^T \log (T_m^{(i)}) e_1, \end{aligned}$$

where \(X^{(1)}, \ldots , X^{(N)}\) are independent Gaussian or Rademacher random vectors and \(T_m^{(i)}\) is the tridiagonal matrix obtained from the Lanczos method with starting vector \(X^{(i)} / \Vert X^{(i)} \Vert _2\). By combining the results obtained so far, we now derive new bounds on the number of samples and number of Lanczos steps needed to ensure an approximation error of at most \(\varepsilon \) (with high probability).

4.1 Standard Gaussian Random Vectors

Theorem 4

Suppose that the following holds for N (number of Gaussian probe vectors) and m (number of Lanczos steps per probe vector):

  1. (i)

    \(N \ge 16 \varepsilon ^{-2} (\rho _{\log } \Vert \log (A) \Vert _2^2 + \varepsilon \Vert \log (A) \Vert _2) \log \frac{4}{\delta }\), where \(\rho _{\log }\) denotes the stable rank of \(\log (A)\);

  2. (ii)

    \(m \ge \frac{\sqrt{\kappa (A)+1}}{4} \log \big ( {4\varepsilon ^{-1} n^2(\sqrt{\kappa (A)+1}+1) \log (2\kappa (A))} \big )\).

If, additionally, \(n\ge 2\) and \(N \le \frac{\delta }{2} \exp \big ( \frac{n^2}{16} \big )\) then \( {\mathbb {P}}(|\,\mathrm {est}^{\mathsf {G}}_{N,m} - \log \det (A)| \ge \varepsilon ) \le \delta . \)

Proof

For a Gaussian vector X, the squared norm \(\Vert X\Vert _2^2\) is a Chi-squared random variable with n degrees of freedom. Therefore, by [29, Lemma 1] we have

$$\begin{aligned} {\mathbb {P}}(\Vert X \Vert _2^2 \ge n + 2\sqrt{nt} + 2t) \le \exp (-t) \end{aligned}$$

for every \(t > 0\). For \(t = \log \frac{2N}{\delta }\), the additional assumptions of the theorem imply

$$\begin{aligned} n + 2\sqrt{nt} + 2t \le n + 2\sqrt{n} \cdot \frac{n}{4} + 2\cdot \frac{n^2}{16} < n^2, \end{aligned}$$

and therefore \({\mathbb {P}}(\Vert X\Vert _2^2 \ge n^2) \le \frac{\delta }{2N}\). By the union bound, it holds that

$$\begin{aligned} {\mathbb {P}}\left( \text {exists } i \in \{1,\ldots ,N\} \text { s.t. }\Vert X^{(i)} \Vert _2^2 \ge n^2 \right) \le \frac{\delta }{2}. \end{aligned}$$
(14)

Corollary 4, together with condition (ii) and (14) imply that \( | \,\mathrm {est}^{\mathsf {G}}_{N,m} - \,\mathrm {tr}^{\mathsf {G}}_N(\log (A)) | \le \frac{\varepsilon }{2} \) holds with probability at least \(1 - \delta / 2\), where we also used that \( \log \left( \frac{ \sqrt{\kappa (A)+1}+ 1}{\sqrt{\kappa (A)+1} -1}\right) \ge \frac{2}{\sqrt{\kappa (A)+1}}\).

Applying Theorem 1 to the matrix \(\log (A)\), for which \(\Vert \log (A) \Vert _F^2 = \rho _{\log }\Vert \log (A)\Vert _2^2\), we find that \( |\,\mathrm {tr}^{\mathsf {G}}_N(\log (A)) - \log \det (A)| \le \frac{\varepsilon }{2} \) holds with probability at least \(1 - \delta / 2\). The proof is concluded by applying the triangle inequality. \(\square \)

4.2 Rademacher Random Vectors

Theorem 5

Suppose that the following holds for N (number of Rademacher probe vectors) and m (number of Lanczos steps per probe vector):

  1. (i)

    \(N \ge 32 \varepsilon ^{-2} \left( \rho _{\mathrm {logd}} \Vert \log (A) - \,\mathrm {D}_{\log (A)} \Vert _2^2 + \frac{\varepsilon }{2} \Vert \log (A) - \,\mathrm {D}_{\log (A)} \Vert _2 \right) \log \frac{2}{\delta }\), where \(\rho _{\mathrm {logd}}\) denotes the stable rank of \(\log (A) - \,\mathrm {D}_{\log (A)}\) and \(\,\mathrm {D}_{\log (A)}\) is the diagonal matrix containing the diagonal entries of \(\log (A)\);

  2. (ii)

    \(m \ge \frac{\sqrt{\kappa (A)+1}}{4} \log \big ( {4 \varepsilon ^{-1} n(\sqrt{\kappa (A) + 1} + 1)\log (2\kappa (A)) }\big )\).

Then \({\mathbb {P}}(|\,\mathrm {est}^{\mathsf {R}}_{N,m} - \log \det (A)| \ge \varepsilon ) \le \delta \).

Proof

Using Corollary 3 and the fact that Rademacher random vectors have norm \(\sqrt{n}\), the bound \( \big | \,\mathrm {est}^{\mathsf {R}}_{N,m} - \,\mathrm {tr}^{\mathsf {R}}_N(\log (A)) \big | \le \frac{\varepsilon }{2} \) holds if

$$\begin{aligned} m \ge \frac{1}{2} \log \big ( {4 \varepsilon ^{-1} n(\sqrt{\kappa (A)+1}+1) \log (2\kappa (A)) } \big ) \big / \log \left( \frac{ \sqrt{\kappa (A)+1}+ 1}{\sqrt{\kappa (A)+1} -1}\right) . \end{aligned}$$

Because of \(\log \Big ( \frac{ \sqrt{\kappa (A)+1}+ 1}{\sqrt{\kappa (A)+1} -1}\Big ) \ge \frac{2}{\sqrt{\kappa (A)+1}}\), condition (ii) ensures that this inequality is satisfied.

Applying Corollary 1 to \(\log (A)\) and with \(\varepsilon \) replaced by \(\varepsilon /2\), immediately shows

$$\begin{aligned} |\,\mathrm {tr}^{\mathsf {R}}_N(\log (A)) - \log \det (A)| \le \frac{\varepsilon }{2} \end{aligned}$$
(15)

with probability at least \(1 - \delta \) if condition (i) is satisfied. The proof is concluded by applying the triangle inequality. \(\square \)

Comparison with existing result. To compare Theorem 5 with an existing result from [40], it is helpful to first derive a simpler (but usually stronger) condition on N.

Lemma 6

The statement of Theorem 5 holds with condition (i) replaced by \(N \ge 8 \varepsilon ^{-2} \big ( n \log ^2 \kappa (A) + 2\varepsilon \log \kappa (A) \big ) \log \frac{2}{\delta }.\)

Proof

We set \(B := \lambda A\) with \(\lambda := 1/\sqrt{\lambda _{\min }(A) \lambda _{\max }(A)}\) and note that

$$\begin{aligned} \,\mathrm {tr}^{\mathsf {R}}_N(\log (A)) - \log \det (A) = \,\mathrm {tr}^{\mathsf {R}}_N(\log (\lambda A)) - \log \det (\lambda A). \end{aligned}$$

Using \(\lambda _{\max }(B) = \sqrt{\kappa (A)}\), \(\lambda _{\min }(B) = 1/\sqrt{\kappa (A)}\), and \(\kappa (B) = \kappa (A)\), we obtain

$$\begin{aligned} \Vert \log (B) - \,\mathrm {D}_{\log (B)} \Vert _2&\le 2 \Vert \log (B) \Vert _2 = \log \kappa (A);\\ \Vert \log (B) - \,\mathrm {D}_{\log ({B})} \Vert _F^2&\le \Vert \log (B) \Vert _F^2 = \rho (\log (B)) \frac{ \log ^2\kappa (A)}{4} \le \frac{n}{4} \log ^2\kappa (A). \end{aligned}$$

An application of Corollary 1 to \(\log (B)\) therefore yields (15) with probability at least \(1-\delta \) for \(N \ge 8 \varepsilon ^{-2} \big ( n \log ^2 \kappa (A) + 2 \varepsilon \log \kappa (A) \big ) \log \frac{2}{\delta }\).\(\square \)

Correcting for the two minor erratas explained above, the result from [40, Corollary 4.5] states that \({\mathbb {P}}(|\,\mathrm {est}^{\mathsf {R}}_{N,m} - \,\mathrm {tr}(\log (A))| \ge \varepsilon ) \le \delta \) holds if

$$\begin{aligned} N \ge 24 \varepsilon ^{-2} n^2 \left( \log (1 + \kappa (A)) \right) ^2 \log \frac{2}{\delta } \end{aligned}$$
(16)

and

$$\begin{aligned} m \ge \frac{\sqrt{3\kappa (A)}}{4} \log \big ( 20 \varepsilon ^{-1} n \big ( \sqrt{2\kappa (A)+1} + 1\big )\log (2\kappa (A)+2) \big ). \end{aligned}$$
(17)

Compared to (16), Lemma 6 reduces the explicit dependence on the matrix size from \(n^2\) to n, while the dependence of the bounds on \(\kappa (A)\) is comparable. Let us stress that even a dependence on n does not compare favorably to simply computing the diagonal elements, but the bound from condition (i) of Theorem 5 can often be expected to be significantly better than the simplified bound of Lemma 6. Below we describe a situation in which the former only depends logarithmically on n. Condition (ii) of Theorem 5 improves (17) clearly but less drastically, roughly by a factor \(\sqrt{3}\).

Implications of Low Stable Rank. Let us consider a family of matrices \(\{ A_n \}\) of increasing dimension, a fixed failure probability \(\delta \), and a fixed accuracy \(\varepsilon \); the number of probe vectors required to get \({\mathbb {P}} (|\,\mathrm {tr}^{\mathsf {G,R}}_N(\log (A_n)) - \,\mathrm {tr}(\log (A_n))| \ge \varepsilon ) \le \delta \) is proportional to \(O(\rho _{n} \Vert \log (A_n)\Vert _2^2)\), where \(\rho _n\) is the stable rank of \(\log (A_n)\). In certain applications, including regularized kernel matrices (see, e.g., [12, 20]), the stable rank grows slowly when the matrix size increases. For such situations, our bounds lead to favorable implications. To illustrate this, let us consider matrices \(A_n := I + B_n\), where the eigenvalues satisfy \(\lambda _i(B_n) \le nC\alpha ^i\) for some constants \(C>0\) and \(0< \alpha < 1\), for all \(i\le n\), such as in the discretization of a radial basis function kernel on a fixed domain [20]. In this case, \(\rho _n = O(\log n)\). As a second example, if \(B_n\) comes from a discretization of a Matérn kernel on a regular grid in a fixed domain, its eigenvalues satisfy \(\lambda _i(B_n) \le nCi^{-\beta }\) for some constants \(C>0\) and \(\beta >1\), for all \(i \le n\) [12]; the stable rank of \(\log (A_n) = \log (I + B_n)\) is bounded by \(\rho _{n} = O(n^{1/\beta })\).

To apply Theorems 4 and 5 one also needs to take into account that, for both our examples, \(\Vert \log (A_n)\Vert _2\) and \(\kappa (A_n)\) grow proportionally to \(\log (n)\) and n, respectively. Finally, note that in practice one would consider \(A_n = \sigma I + B_n\) with the regularization parameter \(\sigma \) chosen adaptively; see, e.g., [11].

Fig. 2
figure 2

Estimation of \(\,\mathrm {tr}(A^3)\) with Gaussian and Rademacher vectors for the matrix from Example 1. Error bounds from Theorem 1, Corollary 1, and [3] for failure probability \(\delta = 0.05\) compared with the observed error

Fig. 3
figure 3

Number of samples needed to attain error \(\varepsilon = \frac{1}{10} \,\mathrm {tr}(A^3)\) with failure probability \(5\%\) for Example 1. Empirical failure probability vs. bounds from Theorem 1 and Corollary 1

5 Numerical Experiments

In this section, we report on a number of numerical experiments illustrating the bounds obtained in this work. All numerical experiments have been performed in Matlab, version 9.9 (R2020b).

Example 1

To compare the estimates from Theorem 1 and Corollary 1 with the convergence of randomized trace estimation using Gaussian and Rademacher vectors, we use an example from [3, 32]. The number of triangles in an undirected graph is equal to \(\frac{1}{6} \,\mathrm {tr}(A^3)\) where A is the (usually indefinite) adjacency matrix. Note that the quadratic forms \(X^T A^3 X\) can be evaluated exactly using two matrix-vector multiplications. The results for an arXiv collaboration network with \(n = 5\,242\) nodes and \(48\,260\) triangles.Footnote 3

We estimate \(\,\mathrm {tr}(A^3)\) using \(N = 2, 2^2, 2^3, \ldots , 2^{11}\) samples. For each value of N we performed 1000 experiments and discarded the 5% worst approximations in order to estimate an error bound that holds with probability \(95\%\). The obtained results are represented by the shaded regions in Fig. 2 and match the obtained bounds fairly well, especially for Gaussian vectors.

Figure 3 shows the empirical failure probability \({\mathbb {P}}(|\,\mathrm {tr}^{\mathsf {G,R}}_N(A^3) - \,\mathrm {tr}(A^3)| \ge \varepsilon )\) with \(\varepsilon = \frac{1}{10} \,\mathrm {tr}(A^3)\) using 1000 experiments for \(N = 2, 2^2, 2^3, \ldots , 2^{11}\) (blue and red lines). The vertical purple and yellow lines are the estimated number of samples needed to achieve failure probability \(\delta = 0.05\) from Theorem 1 and Corollary 1, respectively.

Table 1 Summary of the matrices used for log determinant experiments

Example 2

To compare the results of Theorems 4 and 5 with the number of sample vectors N and Lanczos steps m (per sample) required to reach a fixed accuracy, we consider the matrices listed in Table 1. The matrix thermomec_TC is contained in the University of Florida sparse matrix collection [14] and has been considered, for instance, in [10, 18, 40]. The matrix lowrank is defined in [30, 37] as

$$\begin{aligned} A = \sum _{j=1}^{40} \frac{10}{j^2} x_j x_j^T + \sum _{j=41}^{300} \frac{1}{j^2} x_j x_j^T, \end{aligned}$$

where each \(x_j\) is a sparse vector of length 20,000 with approximately \(2.5\%\) uniformly distributed nonzero entries, generated with the Matlab command sprand. The matrix precip is a two-dimensional Gaussian kernel matrix with length parameter \(\gamma = 64\) and regularization parameter \(\lambda = 0.008\) taken from [32], involving precipitation data from Slovakia [33]. As the matrices thermomec_TC and lowrank are too large for \(\log (A)\) to be computed explicitly, the quantities \(\Vert \log (A)\Vert _F\) and \(\Vert \log (A) - \,\mathrm {D}_{\log (A)}\Vert _F\) are approximated by randomized trace estimation combined with the Lanczos method to estimate the diagonal elements of \(\log (A)\).

For quadratic forms involving the logarithm, there is a relatively inexpensive way to obtain an upper bound on the error of the Lanczos method. As discussed in [5], Gauss quadrature always yields an upper bound for \(x^T \log (A)x\), while Gauss-Lobatto quadrature always yields a lower bound. We fix \(\delta = 0.1\) and for several values of \(\varepsilon \) we investigate how many samples and Lanczos iterations are needed in practice. When approximating quadratic forms while aiming at accuracy \(\varepsilon \), we stop the Lanczos method when the difference between upper and lower bound is less than \(\varepsilon /2\). Starting from \(N = 1\), we compute the empirical failure probability \({\mathbb {P}}(|\mathrm {est}_{N,m} - \log \det (A)| \ge \varepsilon )\); if this probability is larger than \(\delta \), we double the number of samples N and repeat.

Fig. 4
figure 4

Results for matrix thermomec_TC from [14]

Fig. 5
figure 5

Results for matrix lowrank from [30]

Fig. 6
figure 6

Results for matrix precip from [32]

The results for the three matrices from Table 1 are reported in Figs. 45, and 6. The left plots show, for the considered values of \(\varepsilon \) (which have been normalized by dividing them by the true \(|\log \det (A)|\)), the number of samples required to attain \(90\%\) success probability over 30 runs of the algorithm, versus the number of samples given by Theorems  4 and 5. The plots on the right show, for the same (normalized) values of \(\varepsilon \), the average number of Lanczos steps required to reach accuracy \(\varepsilon /2\) versus the number of Lanczos steps predicted by Theorems 4 and 5.

For thermomec_TC, the diagonal of \(\log (A)\) is large relative to the rest of the matrix: \(\Vert \log (A)- \,\mathrm {D}_{\log (A)}\Vert _F/\Vert \log (A)\Vert _F \approx 0.07\). Therefore, our bounds predict that Rademacher vectors perform much better than Gaussian vectors; this is indeed confirmed by Fig. 4. The matrix A is well-conditioned and, hence, the bounds correctly predict that the Lanczos method only needs relatively few iterations to attain good accuracy.

For lowrank, Fig. 5 shows that Rademacher and Gaussian vectors perform similarly. Although the condition number of A is \(\kappa (A) \approx 1560\), the eigenvalues have a strong decay, and hence its adaptivity lets the Lanczos method perform much better than predicted by our bounds, see, e.g., [23] for a discussion.

For precip, the ratio \(\Vert \log (A)- \,\mathrm {D}_{\log (A)}\Vert _F/\Vert \log (A)\Vert _F \approx 0.44\) is reflected in Fig. 6, showing that Rademacher vectors attain somewhat better accuracy. The condition number of A is high and there is no strong decay or gaps in the singular values; a relatively large number of Lanczos steps is necessary to obtain the desired accuracy when approximating the quadratic forms.