1 Introduction

We are concerned with the computation of the trace of a matrix function, that is:

$$ \mathcal{T}({\Omega};f) = \text{trace}{(f({\Omega}))}, $$
(1)

where \({\Omega } \in {\mathbb {R}}^{n \times n}\) is a symmetric matrix and f(x) is a real function of sufficient regularity, yet not trivially simple. We are interested in applications where the large size of the matrix Ω prevents usage of full matrix operations and only matrix-vector products are available. A paradigmatic example is obtained by choosing \(f(x) = -x \log (x)\) and Ω a positive semidefinite matrix, albeit large, truncation of a quantum-mechanical operator. In this case \(\mathcal {T}({\Omega };f)\) becomes the von Neumann entropy, a fundamental physical quantity. The same function f(x) is also of relevance to information theory and dynamical systems [2]. A relatively newer and equally important application has recently arisen in relation to graphs, where Ω is the normalized graph Laplacian operator. The von Neumann entropy of graphs has been introduced in [3] and its importance as a statistical indicator in random graphs has been exposed in [4, 5]. From a numerical point of view, its computation is notably challenging, see, e.g., [6] and references therein, and for a broader perspective on matrix functions see [7,8,9].

While a vast literature has dealt with the problem of approximating the trace of a matrix function (too large to review here, but see the early works of Claude Brezinski [10, 11]), in this paper we follow an approach originally developed in the physics literature [12,13,14] and later exposed by Golub and Meurant [15]. In this approach matrix elements of f(Ω) can be reduced to Stieltjes integrals of f(x) with respect to suitable positive measures, as we discuss below. Evaluation of these integrals (at times also providing upper and lower bounds, when f(x) is a completely monotonic function) can be efficiently performed by Gaussian quadrature, which in turn can be profitably derived by the Jacobi matrix of the measure, computed via the Lanczos algorithm. The trace of the matrix function f(Ω) can finally be obtained by summation over all diagonal matrix elements in the canonical basis, cf. Section 2.

It is an interesting generalization to extend the formalism beyond the evaluation of matrix elements. The quantity \(\mathcal {T}({\Omega };f)\) can be computed as the integral

$$ \text{trace}{(f({\Omega}))} = {\int}_{0}^{\infty} f(x) d\nu_{\Omega}(x), $$
(2)

in which νΩ is a fundamental spectral measure associated with a compact operator, the counting measure of the eigenvalues,

$$ \nu_{\Omega} = \sum\limits_{j=1}^{n} \delta_{\lambda_{j}}, $$
(3)

where λj, with j = 1,…,n, are the eigenvalues of Ω and \(\delta _{\lambda _{j}}\) is the atomic measure of unit weight at position λj. This measure has crucial importance in many physical properties, from classical and quantum partition functions in statistical mechanics [12] to random matrix theory [16]. In this context, when normalized to unity, it is frequently called the density of states (DOS). It plays a prominent role also in mathematics, notably in constructive approximations: consider a positive Borel measure supported on a compact subset K of the real line, and let \(\lambda ^{(n)}_{j}\) be the zeros of the orthogonal polynomials of degree n associated with such measure. Remarkable properties are enjoyed by this measure if the normalized counting measure of the zeros has a weak limit, which is the equilibrium measure of the support K in logarithmic potential theory [17].

Numerical algorithms to deal with νΩ are therefore of paramount importance. In [18] it was shown that it is possible to adapt the Lanczos algorithm to construct a sequence of orthogonal matrix polynomials, whose associated Jacobi matrix exactly corresponds to the measure νΩ, and hence the trace in (1) can be computed by direct integration of (2), without passing via matrix elements. The same procedure was later discussed in [19] (also see the references therein), where it was termed the global Lanczos method, a nomenclature that we follow in this work, as opposed to the scalar Lanczos method for matrix elements. The global approach was also employed in a related context in [20, 21]. Formal equivalence of the scalar and global technique will be apparent in our presentation: Claude Brezinski, to whom this paper is dedicated, recognized this equivalence early on [22, 23].

When applied to estimate the trace of a large matrix, these techniques are computationally intensive, even if they only require matrix-vector products, a fact that limits their applicability in the case of huge problems. To overcome this limitation a stochastic approach is then advisable, to provide an estimate of the desired quantity \(\mathcal {T}({\Omega };f)\). A technique to combine random sampling with the Stieltjes approach for matrix elements has been originally developed by Bai, Fahey, and Golub in [1]. The stochastic approach was extended to estimate the Jacobi matrix of the DOS measure in [18]. A further extension of [1] for the computation of the trace of a matrix function in the case of rank-one vectors is proposed in [24]. Moreover, [25] also deals with the computation in the case of indefinite matrices.

In this paper, we apply and extend the stochastic approach in [1] to a block version that collects 1 ≤ kn random vectors in an n × k matrix, similar to what is done in the global Lanczos method in [19]. The new block method reduces to what was proposed in [1] and in [18] in the case of k = 1 and k = n, respectively. We proved that the mean of the bounds obtained by applying our new block version converges to trace(f(Ω)). Moreover, we show that the average of k approximations obtained by the scalar stochastic method is different from what is obtained by the block stochastic algorithm applied to the same random vectors. Nevertheless, the results obtained by the two methods rapidly converge to each other already for small values of k and small Jacobi matrices. From a theoretical point of view, the two methods compute the same number of matrix-vector products, but the block version can take advantage of the modern computer architectures as it occurs for global Lanczos with respect to scalar Lanczos.

While the proposed block stochastic method can be applied to solve the problem (2) for a generic function f, in the numerical experiments we focus on the von Neumann entropy of graphs. Several tests on large real networks reveal results of good accuracy with a relatively low CPU time of the proposed block stochastic method.

This paper is organized as follows: In Section 2 we review the relations that connect the trace of a matrix function to Stieltjes integrals with respect to positive Borel measures and provide bounds by Gaussian and Gauss-Radau formulae. The Lanczos technique is employed in Section 3 to compute the Jacobi matrices of these measures. Our presentation aims to show the formal equivalence of the so-called scalar and global Lanczos technique, the latter described in Section 3.1. In Section 4 we review the Monte Carlo approach in [1], for which we prove theoretically a useful convergence property. In Section 5, following the same formal equivalence that links scalar and global Lanczos, the Monte Carlo approach is generalized to random “vectors” in \({\mathbb {R}}^{n\times k}\), also proving its convergence. The performance of this method is compared in Section 6 with the other reviewed algorithms on the computation of the von Neumann entropy of a sample of complex networks. Section 7 contains concluding remarks.

2 The trace of a matrix function as a quadratic functional

In this section, mainly to fix the notation, we write the trace of a matrix function as a quadratic functional, acting in spaces of vectors or matrices, in a similar fashion. These functionals are then reduced to Stieltjes integrals with respect to corresponding positive Borel measures. The standard theory is applied in Section 2.1 to bound these integrals by Gaussian and Gauss-Radau formulae.

The trace in (1) can be evaluated as a sum of quadratic forms:

$$ \text{trace}{(f({\Omega}))} = \sum\limits_{i=1}^{n} {\mathcal{Q}}(\mathbf{e}_{i};f). $$
(4)

In this equation, ei is the i-th vector in the canonical basis of \({\mathbb {R}}^{n}\) and \({\mathcal {Q}}(\mathbf {u};f)\) is

$$ {\mathcal{Q}}(\mathbf{u};f) = \mathbf{u}^{T}f({\Omega})\mathbf{u} = \langle\mathbf{u},f({\Omega})\mathbf{u}\rangle, $$
(5)

where \(\mathbf {u} \in {\mathbb {R}}^{n}\) and 〈⋅,⋅〉 indicates the scalar product in \({\mathbb {R}}^{n}\).

By gathering the terms in the summation (4) in g groups of k elements (we suppose for simplicity of notation that n = gk, but the procedure can be generalized when k does not divide n, by changing the cardinality of one or more of such groups) one sees that

$$ \text{trace}{(f({\Omega}))} = \sum\limits_{m=1}^{g} {\mathcal{Q}}(\mathbf{E}_{m};f), $$
(6)

where Em are the matrices composed of k columns of the identity matrix:

$$ \mathbf{E}_{m}=\left[\mathbf{e}_{(m-1)k+1},\ldots,\mathbf{e}_{mk}\right], \qquad m=1,\ldots,g, $$
(7)

and accordingly the quadratic form acts on “vectors” U in \({\mathbb {R}}^{n\times k}\) as

$$ {\mathcal{Q}}(\mathbf{U};f) = \text{trace}{(\mathbf{U}^{T} f({\Omega})\mathbf{U})}. $$
(8)

Note that we use the same symbol \({\mathcal {Q}}\) in (4), and (6) and (8), the difference being revealed by the argument of the quadratic form, in lower (4) or upper case (68). Clearly, the quadratic forms depend on Ω. Not to overburden the notation, this dependence will be left implicit—as well as that of related measures. Similarly, we use the same symbol 〈⋅,⋅〉 for the scalar product in \({\mathbb {R}}^{n}\), (5), and the scalar product in the space of n × k real matrices, whose associated norm is the Frobenius norm:

$$ \langle \mathbf{U}, \mathbf{V} \rangle = \text{trace}{(\mathbf{U}^{T} \mathbf{V})}. $$
(9)

This usage cannot lead to confusion. In fact, it is easily seen that (5) is a special case of (9) for k = 1, where U reduces to u and V = f(Ω)u. Moreover, letting k = n implies that E1 is the identity matrix, so that \({\mathcal {Q}}(\mathbf {I};f)=\text {trace}{(f({\Omega }))}\) and (2) can also be fitted into this framework. Throughout the paper, we shall move to and fro the two limiting cases.

Let us therefore focus on the quadratic functional (8), starting from the case k = 1.

2.1 Gaussian quadrature

Gauss-type quadrature rules [12,13,14,15, 26] can be employed to estimate the k = 1 case in the previous subsection, that is, the quadratic form (5). An overview is presented in [15]. This technique also provides upper and lower bounds to \({\mathcal {Q}}(\mathbf {u};f)\). Although it is now quite standard, we briefly review it, since is at the core of successive developments.

The quadratic form in (5) takes a significant form when one employs the spectral factorization

$$ {\Omega}=Q{\Lambda} Q^{T}, $$
(10)

in which the columns of the orthogonal matrix \(Q=[\mathbf {q}_{1},\mathbf {q}_{2},\dots ,\mathbf {q}_{n}]\in {\mathbb {R}}^{n\times n}\) are the orthonormalized eigenvectors of Ω: Ωqj = λjqj, with j = 1,…,n, and Λ is the diagonal matrix \({\Lambda } = \text {diag}\left [\lambda _{1},\lambda _{2},\dots ,\lambda _{n}\right ]\in {\mathbb {R}}^{n\times n}\). Eigenvalues and eigenvectors are conveniently ordered according to λ1λ2 ≤⋯ ≤ λn. In so doing, one obtains

$$ {\mathcal{Q}}(\mathbf{u};f) = \mathbf{u}^{T} Qf({\Lambda})Q^{T} \mathbf{u} = \sum\limits_{j=1}^{n} \langle \mathbf{u}, \mathbf{q}_{j}\rangle^{2} f(\lambda_{j}), $$
(11)

which can be understood as a Stieltjes integral:

$$ {\mathcal{Q}}(\mathbf{u};f) = {\int}_{0}^{\infty} \! f(x) d\mu_{\mathbf{u}}(x). $$
(12)

In the above, μu is a piece-wise constant distribution function with jumps of size 〈u,qj2 at the eigenvalues λj of Ω. Equivalently, μu is a positive discrete measure composed of n atoms:

$$ \mu_{\mathbf{u}} = \sum\limits_{j=1}^{n} \langle \mathbf{u}, \mathbf{q}_{j}\rangle^{2} \delta_{\lambda_{j}}. $$
(13)

We can take u as a unit norm vector so that μu is a probability measure. Therefore, the integral (12) can be approximated by the -point Gaussian quadrature rule

$$ {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{u}},f) =\sum\limits_{i=1}^{\ell} w_{i}^{(\ell)} f(z_{i}^{(\ell)}). $$
(14)

Equation (14) is itself an integral of f with respect to a discrete measure. As it is well known, this measure can be derived from J()(μu), the truncation of rank of the Jacobi matrix of μu, which can be formally written as

$$ J^{(\ell)}(\mu_{\mathbf{u}}) = \begin{bmatrix} a_{0} & b_{1} & & & \\ b_{1} & a_{1} & b_{2} & & \\ & b_{2} & {\ddots} & {\ddots} & \\ & & {\ddots} & {\ddots} & b_{\ell-1}\\ & & & b_{\ell-1} & a_{\ell-1} \end{bmatrix} \in{{\mathbb{R}}}^{\ell\times \ell}. $$

In fact, the nodes \(z_{i}^{(\ell )}\) are the eigenvalues of J(μu) and the weights \(w_{i}^{(\ell )}\) are the squares of the first components of the associated normalized eigenvectors [27], quantities that can be profitably computed by the Golub-Welsch method [28].

Note that quantity (14) can also be computed starting from the Jacobi matrix J()(μu), that is:

$$ {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{u}},f) = \mathbf{e}_{1}^{T} f(J^{(\ell)}(\mu_{\mathbf{u}})) \mathbf{e}_{1}. $$
(15)

3 The Lanczos method

The required Jacobi matrix can be obtained by steps of the Lanczos method reproduced here in Algorithm 1, applied to the matrix Ω with initial vector u.

Suppose that the function f(x) is completely monotonic, i.e., when z belongs to the convex hull of the spectrum of Ω, the -th derivative of f is such that (− 1)f()(z) ≥ 0 (at least, in a range of values of ). Then, analysis of the remainder formula for Gaussian integration reveals that \({\mathcal {G}}^{(\ell )}(\mu _{\mathbf {u}},f)\) provides a lower bound to the integral (12). It has been proved, in exact arithmetic, that in this case increasing the bound gets tighter; see [15].

Under the same conditions, one also proves that the ( + 1)-point Gauss-Radau quadrature rule \({\mathcal {R}}^{(\ell +1)}(\hat \mu _{\mathbf {u}},f)\) yields tightening upper bounds. This quadrature is constructed with a fixed node at z0 (z0 can be any real value to the left of the spectrum of Ω). The rule can be computed in the same way as in (15), from the truncated Jacobi matrix of the positive measure \(d \hat {\mu }_{\boldsymbol {u}}(s) = (s-z_{0}) d\mu _{\boldsymbol {u}}(s)\), which can be easily obtained as follows:

$$ J^{(\ell+1)}(\hat{\mu}_{\mathbf{u}}) = \begin{bmatrix} J^{(\ell)}(\mu_{\mathbf{u}}) & b_{\ell} \mathbf{e}_{\ell} \\ b_{\ell} \mathbf{e}^{T}_{\ell} & \hat{a}_{\ell} \end{bmatrix} \in{{\mathbb{R}}}^{(\ell+1)\times (\ell+1)}, $$

In this equation \(\hat {a}_{\ell } = z-b_{\ell }^{2} \pi _{\ell -1}(z_{0})/ \pi _{\ell }(z_{0})\), and π is the monic orthogonal polynomial of degree of μu. Similarly to the computation of \({\mathcal {G}}^{(\ell )}(\mu _{\mathbf {u}},f)\) in (14), the quantity \({\mathcal {R}}^{(\ell +1)}(\hat \mu _{\mathbf {u}},f)\) can also be computed starting from the Jacobi matrix \(J^{(\ell +1)}(\hat {\mu }_{\mathbf {u}})\), that is:

$$ {\mathcal{R}}^{(\ell+1)}(\hat\mu_{\mathbf{u}},f) = \mathbf{e}_{1}^{T} f(J^{(\ell+1)}(\hat\mu_{\mathbf{u}})) \mathbf{e}_{1}. $$
(16)

Combining the two bounds, when f is completely monotonic, the following sequence of inequalities holds for any :

$$ {\mathcal{G}}^{(\ell-1)}(\mu_{\mathbf{u}},f)<{\mathcal{G}}^{(\ell)}(\mu_{\mathbf{u}},f)< {\mathcal{Q}}(\mathbf{u};f) <{\mathcal{R}}^{(\ell+1)} (\hat \mu_{\mathbf{u}},f)<{\mathcal{R}}^{(\ell)} (\hat \mu_{\mathbf{u}},f); $$
(17)

see, e.g., [15, 27, 29] for details. In practice, it is sufficient that the j-th derivative of f is completely monotonic for a certain j, possibly small for the computational purpose.

We assume that the Lanczos algorithm does not breakdown, that is, b− 1 in the matrix J(μu) is larger than zero. When b is null, the spectrum of the Jacobi matrix is a subset of the spectrum of Ω and \({\mathcal {G}}^{(\ell )}(\mu _{\boldsymbol {u}},f) = {\mathcal {R}}^{(\ell +1)}(\hat \mu _{\boldsymbol {u}},f) = \boldsymbol {u}^{T} Qf({\Lambda })Q^{T} \boldsymbol {u}\). If b turns out to be negative, this means that numerical instabilities (typically, loss of normalization) have affected the computation.

Algorithm 1
figure a

The Scalar Lanczos algorithm

Algorithm 2
figure b

The Global Lanczos algorithm

3.1 The global Lanczos method

The method of the previous section can be generalized to the case of the quadratic form (8), replacing vectors u by matrices U in steps 1 to 8 (see Algorithm 2). The generalization is particularly instructive when k = n and U = I is the identity matrix. In this case, the Lanczos Algorithm 2 yields the sequence of matrix orthogonal polynomials pj(Ω) that satisfy the three-terms relation

$$ {\Omega} p_{j} ({\Omega})= b_{j+1} p_{j+1}({\Omega}) + a_{j} p_{j}({\Omega}) + b_{j} p_{j-1}({\Omega}), $$
(18)

initialized by p0(Ω) = I and p− 1(Ω) = 0, \(\mathbf {0} \in {\mathbb {R}}^{n \times k}\) being the null operator. It is easy to show (see, e.g., [18], Sect. 10 Lemma 10.1) that the Jacobi matrix so produced corresponds to the measure μI that, in this case, is precise νΩ defined in (3), the counting measure of the eigenvalues. In the mathematical physics literature, this measure (when normalized to unity) is also known as the density of states of the operator Ω, to distinguish it from the local density of states, typically represented by (13).

Because of this formal equivalence, the considerations presented in the previous subsection extend to this case: when f(x) is a completely monotonic function, Gauss and Gauss-Radau quadratures provide upper and lower bounds to the desired quantity (2).

The case 1 < k < n, discussed in [19], is intermediate between the two extremes and can be described by the same theory. In fact, in this case, one can identify the measure μU as follows. Let \(\mathbf {U} = [\mathbf {u}_{1},\mathbf {u}_{2},\ldots ,\mathbf {u}_{k}] \in {\mathbb {R}}^{n \times k}\) be a matrix with normalized columns \(\mathbf {u}_{s} \in {\mathbb {R}}^{n}\), for s = 1 to k. Taking f(x) = x, we observe that

$$ {\mathcal{Q}}(\mathbf{U};f) = \langle \mathbf{U}, {\Omega} \mathbf{U} \rangle = \sum\limits_{j=1}^{n} \lambda_{j} \sum\limits_{s=1}^{k} \langle \mathbf{u}_{s} , \mathbf{q}_{j} \rangle^{2}, $$
(19)

which shows that the measure μU is

$$ \mu_{\mathbf{U}} = \sum\limits_{j=1}^{n} \mu_{\mathbf{U}_j} \delta_{\lambda_{j}}, $$
(20)

with the positive weights \(\mu _{\mathbf {U}_j}\) implicitly defined by (19). Quite in the same way as before, computing the Jacobi matrix J(μU) via the Lanczos algorithm and proceeding with Golub and Welsch diagonalization and finally quadratures yield upper and lower bounds to \({\mathcal {Q}}(\boldsymbol {U};f)\) for a completely monotonic f(x). The synthetic scheme of this procedure is presented in Algorithm 2.

Finally, choosing in sequence U = Em, m = 1,…,g, as in (7), one obtains upper and lower bounds to the desired quantity f(Ω), via (6).

4 Monte Carlo approach

Often, the size of the matrix Ω is exceedingly large so computer time limitations hinder the computation of the full sums (4) or (6). A stochastic approach is then advisable, to provide a probabilistic estimate of the desired quantity \(\mathcal {T}({\Omega };f)\). A first approach employs the following result from [30, 31] that the reader can easily prove by direct computation (also see below, in Section 5, (24)).

Proposition 1 (30, 31)

Let H be an n × n symmetric matrix. Let \(\mathbf {u}\in {\mathbb {R}}^{n}\) be a random vector whose entries are independent, equally distributed random variables of zero mean and unit variance. Then, the random variable 〈u,Hu〉 is an unbiased estimator of trace(H), while the variance of the estimator is \(2 {\sum }_{i \neq j} \langle \mathbf {e}_{i}, H \mathbf {e}_{j} \rangle ^{2}\).

To apply this proposition to the problem at hand, where H = f(Ω), the matrix function must be evaluated first. To solve this problem, the authors of [1] combined stochastic sampling with the Stieltjes technique. This produces Algorithm 3 reproduced herein, where we use the symbol k to denote the cardinality of the random sample. This choice is dictated by formal equivalence with the material in Section 3.1, as it will be apparent in a moment.

Algorithm 3
figure c

The Monte Carlo approach introduced in [1]

In words, the algorithm evaluates upper and lower bounds for the k functionals \({\mathcal {Q}}(\mathbf {u}_{s};f)\) and averages these values to get an estimate of trace(f(Ω)). In fact, observe that the bounds L(us;f) and U(us;f) are random function themselves, via the random vectors us. Consider the inequality (10) in reference [1] (adapted to our notation)

$$ \frac{1}{k} \sum\limits_{s=1}^{k} L^{(\ell)}(\mathbf{u}_{s};f) \leq \frac{1}{k} \sum\limits_{s=1}^{k} \langle \mathbf{u}_{s}, f({\Omega}) \mathbf{u}_{s} \rangle \leq \frac{1}{k} \sum\limits_{s=1}^{k} U^{(\ell+1)}(\mathbf{u}_{s};f) $$
(21)

and the related sentence from the same paper: It is natural to expect that with suitable sample size, the mean of the computed bounds yields a good estimation of the quantity trace(f(Ω)). Yet, Proposition 1 only guarantees that the expectation of the central term in the above inequalities is equal to the desired quantity so that the expectation of the first and third term (the ones computed by the algorithm) are respectively smaller and larger than trace(f(Ω)). Clearly, the expectation of the random variable \(\frac {1}{2} [L^{(\ell )}(\boldsymbol {u}_{s};f) + U^{(\ell +1)}(\boldsymbol {u}_{s};f)]\) is between these bounds, but it cannot be assumed to be equal to trace (f(Ω)). This leaves a theoretical gap that was implicitly acknowledged in the quoted sentence.

In practice, though, the statistical variation in the computation of the expectation of the indicator \(\frac {1}{2} [L^{(\ell )}(\mathbf {u}_{s};f) + U^{(\ell +1)}(\mathbf {u}_{s};f)]\) is much wider than its possible bias, so that numerical experiments in [1, 18] together with the use of statistical inequalities seemed to provide reliable estimates.

We are now in the position to show theoretically that the possible bias in the above indicator decreases in the limit of increasing Jacobi matrix size, actually in a wider context than originally considered in [1].

Proposition 2

Let Ω be an n × n symmetric matrix. Let \(\mathbf {u} \in {\mathbb {R}}^{n}\) be a random vector whose entries are independent, equally distributed random variables of zero mean and unit variance, and let f be a completely monotonic real function. Then, the expectation of the random variables \({\mathcal {G}}^{(\ell )}(\mu _{\boldsymbol {u}},f)\) and \({\mathcal {R}}^{(\ell +1)}(\hat \mu _{\boldsymbol {u}},f)\) converge to trace(f(Ω)) when tends to infinity.

Proof

On the one hand, each quadrature in the proposition is a function of the random vector u in the probability space \({\mathbb {R}}^{n}\), which is drawn according to a generic Borel probability measure ρ(u). On the other hand, when the moment problem is determined, which is always the case for the spectral measure μu of a finite matrix Ω (the result extends to a much larger family of operators) for any fixed vector u the sequence \({\mathcal {G}}^{(\ell )}(\mu _{\mathbf {u}},f)\) converges monotonically (for a completely monotonic f) to \({\int \limits } f(x) d\mu _{\mathbf {u}}(x)\), which is equal to 〈u,f(Ω)u〉. Therefore, we can apply Beppo Levi’s theorem to prove that

$$ \begin{array}{@{}rcl@{}} &&\lim_{\ell \rightarrow \infty} \int d \rho(\mathbf{u}) {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{u}},f) = \int d \rho(\mathbf{u}) \lim_{\ell \rightarrow \infty} {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{u}},f) = \\ &&= \int d \rho(\mathbf{u}) \int f(x) d\mu_{\mathbf{u}}(x) = \int d \rho(\mathbf{u}) \langle \mathbf{u}, f({\Omega}) \mathbf{u} \rangle = \text{trace}{(f({\Omega}))}. \end{array} $$
(22)

The first term in the above equation is the limit of the expectation of the random variable \({\mathcal {G}}^{(\ell )}(\mu _{\mathbf {u}},f)\) that therefore converges to the expectation of the random variable 〈u,f(Ω)u〉, which, in force of Proposition 1, is equal to the trace of f(Ω), see also (23) and (24) below. Similar reasoning applies to the upper bounds provided by Gauss-Radau quadrature. □

It is evident that, in exact arithmetic, when Ω is a finite matrix the limit in the above proposition is certainly achieved at = n. Since is typically smaller than n in applications, the monotonic decrease of the bias predicted theoretically is of practical relevance. As mentioned before, the argument can be extended to more general operators in Hilbert spaces of infinite dimension, like the multiplication operator by x in \(L^{2}_{\mu }\), when μ is a singular continuous measure [32], where the limit for tending to infinity is essential.

Summarizing, the quantities \(\frac {1}{k} {\sum }_{s=1}^{k} L^{(\ell )}(\mathbf {u}_{s};f)\) and \(\frac {1}{k} {\sum }_{s=1}^{k} U^{(\ell +1)}(\mathbf {u}_{s};f)\) are stochastically distributed, with standard deviation inversely proportional to \(\sqrt k\), around a mean value which tends to the trace of f(Ω) when grows. In the practical implementation the individual components of us are conveniently chosen as random variables taking the two values one and minus one with equal probability—the so-called \(\mathbb {Z}_{2}\) noise—divided by \(\sqrt {n}\) to achieve normalization.

When the size n of Ω is exceedingly large, affordable values of k imply that the above standard deviations are much larger than the difference between the two sample averages: it is then unnecessary to consider (and compute) both. In force of the central limit theorem, the distribution of these samples approaches a normal one for k large, yielding stricter probabilistic bounds than those obtained by Hoeffding’s inequality [18].

5 Stochastic evaluation of the Jacobi matrix of the eigenvalue counting measure

As described at the beginning of Section 3.1, Algorithm 2 with the identity matrix as starting “vector” yields the sequence of matrix orthogonal polynomials pj(Ω). It requires the computation of the scalar products \(\langle p_{j}({\Omega }),{\Omega }^{r} p_{j}({\Omega }) \rangle \), where r can take the values zero and one. Since the scalar product (9) is defined as the trace of an operator, it is computed by a summation involving all basis vectors ei, i = 1 to n, as in (4), or quite equivalently all matrices Em, \(m=1, \dots , g\), as in (6). Clearly, the computational complexity of this procedure grows as n times the complexity of the matrix-vector product, which is the most computationally demanding task and is linear in n in the case of sparse Ω as in complex network applications.

A stochastic implementation of the Lanczos algorithm for the Jacobi matrix of the counting measure that vastly reduces computational complexity was described in [18] Sect. 10, Algorithm 5 remark just following. The core of the algorithm coincides with steps 1 to 8 in the Global Lanczos algorithm proposed in [19]. In fact, the stochastic method starts from constructing k normalized random vectors us as in the previous section, but it assembles them in an n × k random matrix U = [u1,u2,⋯,uk]. The Global Lanczos algorithm 2 is then applied to U. The difference between the scalar version (Algorithm 1) and the global version (Algorithm 2) can be appreciated by spelling out the steps in the latter. In fact, taking into account the structure of the recursion relation and of the scalar product, the sequence of operations can be described as in Algorithm 4.

Algorithm 4
figure d

Stochastic computation of trace(f(Ω)) via the Jacobi matrix of the eigenvalue counting measure.

A few important remarks are in order. Firstly, observe the difference between Algorithms 3 and 4. While they take the same input and attempt to compute the same quantity, the procedure is different. In fact, Algorithm 3 computes the Jacobi matrices \(J^{(\ell )}(\mu _{\mathbf {u}_{s}})\) of the local density of states \(\mu _{\mathbf {u}_{s}}\) in (13) and successively employs Proposition 2. At difference, Algorithm 4 aims at a stochastic computation of the Jacobi matrix of the global density of states, via the measure μU in (20). To put it even more clearly, while in the former algorithm each stochastic vector us yields a Jacobi matrix \(J^{(\ell )}(\mu _{\mathbf {u}_{s}})\), in the latter the vectors us conspire to form a single Jacobi matrix J()(μU), as can be noted in lines 6 and 10 of Algorithm 4 that are outside the loop 7–9. The two procedures yield different results that rapidly converge to the same limit when grows. Numerical experience (see Section 6) suggests that this convergence can be quite rapid, also profiting from the phenomenon of self-averaging [33] of large matrices. The arguments in Proposition 2 can be repeated verbatim in this case, leading to the following result.

Proposition 3

Let Ω be an n × n symmetric matrix. Let U be a random n × k matrix whose entries are independent, equally distributed random variables of zero mean and unit variance, and let f be a completely monotonic real function. Then, the expectation of the random variables \({\mathcal {G}}^{(\ell )}(\mu _{\boldsymbol {U}},f)\) and \({\mathcal {R}}^{(\ell +1)}(\hat \mu _{\boldsymbol {U}},f)\) converge to trace(f(Ω)) when tends to infinity.

Proof

Since U is a random n × k matrix, (22) becomes

$$ \lim_{\ell \rightarrow \infty} \int d \rho(\mathbf{U}) {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{U}},f) = \int d \rho(\mathbf{U}) \lim_{\ell \rightarrow \infty} {\mathcal{G}}^{(\ell)}(\mu_{\mathbf{U}},f) = \int d \rho(\mathbf{U}) \langle \mathbf{U}, f({\Omega}) \mathbf{U} \rangle, $$
(23)

where ρ(U) is a probability measure over \({\mathbb {R}}^{n \times k}\), which is the product of k probability measures ρ(u) over \({\mathbb {R}}^{n}\) employed in Proposition 2, and it is finally the product of n × k real probability measures ρ(x) of null mean and unit variance. The right-hand side of (23) is precisely equal to trace(f(Ω)) because

$$ \begin{array}{@{}rcl@{}} \int d\rho(\mathbf{U}) \langle \mathbf{U}, {\Omega} \mathbf{U} \rangle & = & {\sum}_{j=1}^{n} \lambda_{j} {\sum}_{s=1}^{k} \int d \rho(\mathbf{u}_{s}) \langle \mathbf{u}_{s}, \mathbf{q}_{j} \rangle^{2} = \\ & = & {\sum}_{j=1}^{n} \lambda_{j} \int d \rho(x) x^{2} \sum\limits_{i=1}^{n} \langle \mathbf{e}_{i} , \mathbf{q}_{j} \rangle^{2} = \text{trace}{({\Omega})}. \end{array} $$
(24)

Since \({\int \limits } d \rho (x) x^{2} = 1\) is the variance of ρ(x) and since \({\sum }_{i=1}^{n} \langle \mathbf {e}_{i} , \mathbf {q}_{j} \rangle ^{2} = 1\) by normalization of the eigenvectors. A similar reasoning applies to the upper bounds provided by \({\mathcal {R}}^{(\ell +1)}(\hat \mu _{\mathbf {U}},f)\). □

6 Estimating the von Neumann entropy of graphs

As described in Section 1, the von Neumann entropy is obtained by choosing

$$ f(x)=-x\log(x), $$
(25)

where we define \(0\log (0)=0\) by convention [3]. Simple calculations show that (− 1)f()(x) ≤ 0 for ≥ 2, which implies that the previous theory can be applied for Jacobi matrices of rank larger or equal to two, switching the role of Gauss and Gauss-Radau quadratures because of the reversed inequality.

This section presents the results of numerical experiments on the performance of the methods described in the previous sections. The operator Ω under consideration is the (normalized) Laplacian operator L of several undirected networks. This matrix is defined from Aij, the adjacency matrix, which takes the value one when nodes i and j are connected and zero otherwise, and D, the diagonal matrix of the nodes’ degrees, that is, the number of links connecting a single node:

$$ L = D - A. $$
(26)

Normalization is effected by dividing by trace(L), thereby obtaining a density matrix

$${\Omega} = L/\text{trace}{(L)},$$

with non-negative spectrum and unit trace. The entropy function f defined in (25) is then most appropriately applied to this matrix Ω. Such von Neumann entropy has therefore been previously computed in several works, see [6] and references therein. We have applied the previous techniques to this problem.

Codes have been programmed in Matlab R2021a. Computations have been performed on an Intel Xeon Gold 6136 computer (16 cores, 32 threads) equipped with 128 GB RAM, running the Linux operating system.

We test algorithms 1–4 on the following undirected networks:

Yeast :

(2114 nodes, 4480 edges) describes the protein interaction network for yeast. Each edge represents an interaction between two proteins [34, 35]. The data set was originally included in the Notre Dame Networks Database, and it is now available at [36].

Internet :

(22963 nodes, 96872 edges) is a symmetrized snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP (Border Gateway Protocol) tables posted by the University of Oregon Route Views Project. This snapshot was created by Mark Newman from data for July 22, 2006 [37].

Collaboration :

(40421 nodes, 351304 edges) is the collaboration network of scientists who submitted preprints to the condensed matter archive at www.arxiv.org [38] between January 1, 1995, and March 31, 2005. The original network is weighted, here we consider an unweighted version [37].

Facebook :

(63731 nodes, 1545686 edges) is the largest example we consider. It describes all the user-to-user links (friendships) of the Facebook New Orleans network. It was studied in [39], and the data set is available at [40].

The methods employed are labeled in what follows:

Full diagonalization::

This method computes trace(f(Ω)) as \( - {\sum }_{j=1}^{n} \lambda _{j} \log {(\lambda _{j})}, \) where the eigenvalues λj are obtained by full numerical diagonalization, through the eig function of Matlab, since the computation of λj is well-conditioned being Ω positive semidefinite. We consider this value to be the reference value of the trace. In particular, the error reported both in the tables and in the figures below is the relative error of the other methods with respect to this value.

Scalar Lanczos::

This method computes trace(f(Ω)) as in (4), where each quadratic form \({\mathcal {Q}}(\mathbf {e}_{i};f)\), \(i = 1, \dots , n\), is approximated as in Algorithm 1.

Global Lanczos::

This method computes the trace as in (6), where each quadratic form \({\mathcal {Q}}(\mathbf {E}_{m};f)\), \(m = 1, \dots , g\), is approximated as explained in Section 3.1, Algorithm 2.

Monte Carlo::

This method employs the procedure described in Section 4, Algorithm 3.

Block Monte Carlo::

This method employs the procedure described in Section 5, Algorithm 4.

The cardinality of the set of random vectors for (Block) Monte Carlo is k like the number of columns of the matrices Em in Global Lanczos, which we recall reduces to scalar Lanczos for k = 1.

In the first comparison, the number of Lanczos steps is not fixed a priori, but is chosen on the go to ensure that

$$ \frac{{\mathcal{G}}^{(\ell)}(\mu_{\mathbf{v}},f)-{\mathcal{R}}^{(\ell+1)}(\hat \mu_{\mathbf{v}},f)} {{\mathcal{G}}^{(\ell)}(\mu_{\mathbf{v}},f)} \leq 10^{-3}, \qquad \forall \mathbf{v} \in \mathcal{S}, $$
(27)

where the set \(\mathcal {S}\) is defined in Table 1 depending on the algorithm. We test the previous algorithms for three values of the parameter k = 10,20,30. Tables 234 and 5 shows the relative errors, with respect to the true trace(f(Ω)) obtained by full diagonalization, and the related CPU time. For all networks, the proposed Block Monte Carlo method is robust with respect to the stopping criteria (27) providing a relative error comparable with Global Lanczos but with a much lower CPU time. On the other hand, the CPU time is comparable with the Monte Carlo method, in particular for the large networks in Tables 4 and 5, but with a lower relative error. The low accuracy of the Monte Carlo method is due to the fact that the stopping criterion is applied to each vector us separately, stopping earlier than in the Block Monte Carlo. Nevertheless, for the large networks “Collaboration” and “Facebook,” the CPU time of Block Monte Carlo is lower than that of Monte Carlo.

Table 1 Set \(\mathcal {S}\) of the “vectors” that have to satisfy the stopping criteria (27)
Table 2 Performance of the different methods for the Yeast network (n = 2114)
Table 3 Performance of the different methods for the Internet network (n = 22963)
Table 4 Performance of the different methods for the Collaboration network (n = 40421)
Table 5 Performance of the different methods for the Facebook matrix (n = 63731)

To provide a fair comparison between Block Monte Carlo and Monte Carlo methods, we run the two methods for the same value of \(k = 5, 10, 15, \dots , 200\) and the same number \(\ell =2,3,\dots ,10\) of Lanczos steps comparing the relative errors and the CPU time. Figures 1 and 2 show the relative error and the CPU time, respectively, varying k and in the ranges above. Each pixel is color-coded according to the color bar on the right of each graph. The colors of different graphs cannot be directly compared but must be interpreted according to their color bar. In both figures, the first row contains the results for the Yeast network, while in the second row there are the results for the Internet network. From the first to the third column, we find in sequence the global Lanczos method, the Monte Carlo method, and the Block Monte Carlo method.

Fig. 1
figure 1

Relative error varying \(k = 5, 10, 15, \dots , 200\) and \(\ell =2,3,\dots ,10\). Compare global Lanczos method (first column), Monte Carlo method (second column), block Monte Carlo method (third column) for Yeast network (first row) and Internet network (second row)

Fig. 2
figure 2

CPU time varying \(k = 5, 10, 15, \dots , 200\) and \(\ell =2,3,\dots ,10\). Compare global Lanczos method (first column), Monte Carlo method (second column), block Monte Carlo method (third column) for Yeast network (first row) and Internet network (second row)

The first comparison of the CPU times confirms that the two stochastic methods require a CPU time much lower than the global Lanczos method. Moreover, for the two stochastic methods, the CPU time increases both with k and . On the other hand, for global Lanczos, according to the results in [19], for a fixed , the CPU time can decrease when increasing k. Comparing the two stochastic methods, we observe that for the Yeast network the block Monte Carlo method reduces the CPU time with respect to the Monte Carlo method by a factor two, while for the larger network Internet the reduction factor is about 4. This leads us to expect even greater reductions in computation time for larger problems.

Concerning the relative errors, Fig. 1 shows that all methods produce comparable results. Clearly, the relative error reduces increasing , while it is not very sensitive varying k > 10. In particular, for the two stochastic methods, already a few random vectors, e.g., k = 20, are enough to achieve an accurate enough approximation. This behavior is in agreement with the results in [41] for the scalar Algorithm 3. The two stochastic methods provide about the same relative error although this is usually slightly lower for the block Monte Carlo method, see Fig. 3 where the depicted values, which are obtained by subtracting the relative errors for the block Monte Carlo method from the relative errors for the Monte Carlo method, are usually positive.

Fig. 3
figure 3

Difference of the relative error between Monte Carlo method and the block Monte Carlo Method varying \(k = 5, 10, 15, \dots , 200\) and \(\ell =2,3,\dots ,10\), for Yeast network (left) and Internet network (right)

7 Conclusions

In this paper, we have reviewed various Lanczos techniques for computing the trace of a matrix function. In particular, we have focused on the stochastic approach that was originally presented in [1] for matrix elements—whose associated measure is the local density of states—and later generalized to the global density of states in [18]. Interpolating among these extremes we have defined a block version of such algorithms in a similar fashion to the partial global Lanczos method in [19].

The block Monte Carlo method exhibits interesting properties with respect to accuracy and computational time. Indeed, the numerical results performed to evaluate the Von Neumann entropy of complex networks, show that it is faster than the algorithm in [1], usually providing a better approximation of the solution. Such promising results lead us to believe that the technique might be useful for the general problem (1), when dealing with large matrices, like in the theory of complex networks (see, e.g., the Estrada index studied in [19]) and also in more challenging quantum mechanical applications.