A fast Monte Carlo algorithm for evaluating matrix functions with application in complex networks

We propose a novel stochastic algorithm that randomly samples entire rows and columns of the matrix as a way to approximate an arbitrary matrix function using the power series expansion. This contrasts with existing Monte Carlo methods, which only work with one entry at a time, resulting in a significantly better convergence rate than the original approach. To assess the applicability of our method, we compute the subgraph centrality and total communicability of several large networks. In all benchmarks analyzed so far, the performance of our method was significantly superior to the competition, being able to scale up to 64 CPU cores with remarkable efficiency


Introduction
Complex networks have become an essential tool in many scientific areas [37,69,70] for studying the interaction between different elements within a system.From these networks, it is possible, for instance, to identify the most important elements in the system, such as the key proteins in a protein interaction network [36], the keystone species in an ecological network [59], vulnerable infrastructures [7] or the least resilient nodes in a transportation network [10].
Measures of node importance are referred to as node centrality and many metrics have been proposed over the years [37,69].Here, we consider the family of centrality measures defined in terms of matrix functions [40], which classify the nodes according to how well they can spread information to other nodes in the network.Both the Katz centrality [60] and subgraph centrality [41] belong to this family.In most cases, the node centrality is computed based on the matrix resolvent (I−γA) −1 and the exponential exp (A), but other functions [12,16] can be used as well, with A ∈ R n×n denoting the network's adjacency matrix.In this paper, we focus on the subgraph centrality [41] (i.e., diag(f (A))) and total communicability [18] (i.e., f (A)1) problems, where f (A)) is a given matrix function.
Although these centrality measures have been successfully used in many problems [16,39,40,41], computing a matrix function for large networks can be very demanding using the current numerical methods.Direct methods, such as expm [5,53] or the Schur-Parlett algorithm [27], have a computational cost of O(n 3 ) and yield a full dense matrix f (A), hence are only feasible for small matrices.Methods based on Gaussian quadrature rules [15,43,47] can estimate the diagonal entries of f (A) without evaluating the full matrix function, but are prone to numerical breakdown when A is sparse and large (which is often the case with real networks), and thus, are often employed to determine only the most important nodes in the network.Krylov-based methods [4,34,50,51] can efficiently compute f (A)v, for v ∈ R n , provided that A is well conditioned.Otherwise, their convergence rate can be very slow or even stagnate since a general and well-established procedure to precondition the matrix A does not exist.Rational Krylov methods are often more resilient to the condition number and provide a better approximation to f (A)v than polynomial ones, but require solving a linear system for each vector of the basis.Moreover, the stopping criteria for these methods remains an open issue [51].
Monte Carlo methods [17,29,30,44,58] provide an alternative way to calculate matrix functions, primarily for solving linear systems.In essence, these methods construct a discrete-time Markov chain whose underlying random paths evolve through the different indices of the matrix, which can be formally understood as the Monte Carlo sampling of the corresponding Neumann series.Their convergence has been rigorously established in [17,31,58].Recently, [2,3,48] extended these methods for the evaluation of the matrix exponential and Mittag-Leffler functions.
Another strategy is to construct a random sketch (i.e., a probabilistic representation of the matrix) and then use it to approximate the desired operation.This is a basic idea in contemporary numerical linear algebra [64,66].Some recent studies have shown that a polynomial Krylov method can be accelerated using randomization techniques [26,52].In this paper, we propose a new stochastic algorithm that randomly samples full rows and columns of the matrix as a way to approximate the target function using the corresponding power series expansion.Through an extensive set of numerical experiments, we show that our approach converges much faster than the original Monte Carlo method and that it is particularly effective for estimating the subgraph centrality and total communicability of large networks.We also compare our method against other classical and randomized methods considering very large matrices.
The paper is organized as follows.Section 2 presents a brief overview of the centrality measures defined in terms of matrix functions.Section 3 describes our randomized algorithm and how it can be implemented efficiently.In Section 4, we evaluate the performance and accuracy of our method by running several benchmarks with both synthetic and real networks.We also compare our method against several other algorithms.In the last section, we conclude our paper and present some future work.

Background
In this section, we introduce some definitions and ideas from graph theory that will be used throughout the paper.

Graph definitions
A graph or network G = (V, E) is composed of a set V = {1, 2, . . ., n} of nodes (or vertices) and a set E = {(u, v) : u, v ∈ V } of edges between them [69].A graph is undirected if the edges are bidirectional and directed if the edges are unidirectional.A walk of length k over the graph G is a sequence of vertices v 1 , v 2 , . . ., v k+1 such that (v i , v i+1 ) ∈ E for i = 1, 2 . . ., k.A closed walk is a walk that starts and ends at the same vertex, i.e., v 1 = v k+1 .An edge from a node to itself is called a loop.A subgraph of graph G is a graph created from a subset of nodes and edges of G.The degree of a node is defined as the number of edges entering or exiting the node.In directed graphs, the in-degree counts the number of incoming edges and out-degree, the number of outgoing edges.
A graph G can be represented through an adjacency matrix A ∈ R n×n : where w ij is the weight of the edge (i, j).In this paper, we focus our attention on graphs that are undirected and unweighted (i.e., w ij = 1 for all edges (i, j)) and do not contain loops or have multiple edges between nodes.Consequently, all matrices in this paper will be symmetric, binary, and with zeros along the diagonal.Notwithstanding, it is worth mentioning that our method is general and can be applied to other classes of matrices.

Centrality measures
There are many node centrality measures, and the simplest one must be the degree centrality [45].The degree of a node provides a rough estimation of its importance, yet it fails to take into consideration the connectivity of the immediate neighbours with the rest of the network.Instead, let us consider the flow of information in the network.An important node must be part of routes where the information can flow through, and thus, be able to spread information very quickly to the rest of the network.These information routes are represented as walks over the network.This is the main idea behind walkbased centralities and was formalized by Estrada and Higham [40].
In graph theory, it is well known that the entry (A k ) ij counts the number of walks of length k ≥ 1 over graph G that starts at node i and end at node j.Then, the entry f ij of the matrix function f (A) defined as measures how easily the information can travel from node i to node j.The entry when k is large, in order to penalize the contribution of longer walks and ensure the convergence of the series.The two most common choices for f (A) are the matrix exponential e A and resolvent (I − γA) −1 [40], but other matrix functions can be used as well [11,12,16].
From the power series (2), Estrada defined f-subgraph centrality [40,41] as the diagonal of f (A), that is f SC(i) = (f (A)) ii , and measures the importance of this node based on its participation in all subgraphs in the network.The sum over all nodes of the subgraph centrality has become known as the Estrada Index [28,35], which was first introduced to quantify the degree of folding in protein chains [35], but later extended to characterize the global structure of general complex networks [38,42].
Later, Benzi and Klymko [16,18] introduced the concept of f-total communicability based on the row sum of f (A), ranking the nodes according to how well they can communicate with the rest of the network.Formally, the f-total communicability is expressed as where 1 is a vector of length n with all entries set to 1.If we consider the matrix resolvent f (A) = (I − γA) −1 , the total communicability of a node corresponds to the well-known Katz's Centrality [55,60].
In the context of network science, it is common to introduce a weighting parameter γ ∈ (0, 1) and work with the parametric matrix function f (γA).
Algorithm 1 A Monte Carlo method adapted from [17] for computing a matrix F as an approximation of f (A).N s represents the number of random walks for approximating each row and W c is the weight cutoff.
1: function MC(A, Ns, Wc) 2: for i = 1, 2, . . ., n do ▷ for each row in A 5: for s = 1, 2, . . ., Ns do ▷ for each random walk 6: end while 13: end for 14: end for 15: return F 16: end function The parameter γ can be interpreted as the inverse temperature and accounts for external disturbances on the network [39].Furthermore, the value of γ is often chosen in such a way that the terms ζ k (γA) k in (2) are monotonically decreasing in order to preserve the notion that the information travels faster to nearby nodes compared to those that are farther away.

Randomized algorithm for matrix functions
Ulam and von Neumann [44] were the first to propose a Monte Carlo method for computing the matrix inverse as a way to solve linear systems, which was later refined by [17,29,31].It consists of generating random walks over the matrix A to approximate each power in the corresponding series (2).Starting from a row ℓ 0 , a random walk consists of a random variable W (k) and a sequence of states ℓ 0 , ℓ 1 , . . ., ℓ k , which are obtained by randomly jumping from one row to the next.At each step k, the program updates W (k) and add the results to entry f (A) iℓ k as an approximation for the k-th term in the series (2).
The full procedure is described in Algorithm 1.The SelectNextState routine randomly selects an entry in row ℓ k to determine which row to jump to in the next step of the random walk.The probability of choosing an entry j is equal to P(ℓ k+1 = j | ℓ k = i) = t ij , where t ij is an entry of a transition probability matrix T.
The main limitation of this method is that each random walk only updates a single entry of f (A) at a time, requiring a large number of walks just to estimate a single row with reasonable accuracy.Therefore, our objective is to modify this algorithm such that it samples entire rows and columns of A when approximating each term in the series (2), drastically reducing the number of walks necessary to achieve the desired precision.

Mathematical description of the method
In the following, we discuss how to extend the randomized matrix product algorithm proposed in [32,33] to compute an arbitrary matrix power.
Lemma 1 Let R i and C j denote the i-th row and j-th column of A ∈ R n×n .The matrix power A p with p ∈ N and p ≥ 2 can be evaluated as Proof Recall that the matrix product AB with A, B ∈ R n×n can be expressed as a sum of outer products [33]: where C k is the k-th column of A and B k is the k-th row of B. Therefore, a power p of a square matrix A can be written as where For a single row, ( 5) is reduced to Recursively applying (6) for the powers p, p−1, . . ., 2 and then substituting the expansion in (5) leads to the expression in (4).

□
Corollary 1 Let f (A) : R n×n → R n×n be a matrix function defined by the power series where Concerning matrix U, it can be written as the following sum of rank-one matrices: The multiple sums appearing in the definition of matrix U in (8) can be exploited in practice to construct a probabilistic algorithm similar to [17,29], which was originally created for computing the inverse matrix.In fact, the formal procedure is analogous, but instead of generating random scalar variables our goal here consists of generating randomly rank-one matrices governed by the following Markov chain.
Definition 1 Let {X k : k ≥ 0} be a Markov chain taking values in the state space S = {1, 2, . . ., n} with initial distribution and transition probability matrix given by Here, the indices ℓ m denote the corresponding state reached by the Markov chain after m-steps.We use the initial state ℓ 0 of the Markov chain to choose randomly a column of matrix A, C ℓ0 .After k-steps of the Markov chain, the state is ℓ k , which is used to select the corresponding row of the matrix A, R ℓ k .During this random evolution of the Markov chain different states are visited according to the corresponding transition probability, and along this process a suitable multiplicative random variable W (k) is updated conveniently.Finally, matrix U can be computed through the expected value of a given functional, as proved formally by the following Lemma.
Lemma 2 Let Z(ω) be a realization of a random matrix at a point ω of the discrete sample space, defined as Here W (k) is a multiplicative random variable defined recursively as Then, it holds that Proof Note that Z(ω) is obtained from equation ( 11) as a sum of independent random matrices Y (k) (ω): where Let P (k) (ω) be the probability of occurring an event ω consisting in a transition from ℓ 0 to ℓ k after k steps.This probability turns out to be p ℓ0 t ℓ0ℓ1 t ℓ1ℓ2 . . .t ℓ k−1 ℓ k .Therefore, the expected value of the random matrix Y (k) (ω) is given by, Note that every event ω can be described by different values of k + 1 integer indices, running from 1 to n, then Therefore, from (8) we conclude that E[Y (k) ] = A k+2 .Finally after summing all contributions coming from any number of steps, using (14) and by linearity of the expected value operator we obtain □

Practical implementation of the probabilistic method
To transform Lemma 2 into a practical algorithm, we must first select a finite sample size N s and then compute the expected value E[Z] in (13) as the corresponding arithmetic mean.Additionally, each random walk must terminate after a finite number m of steps.Mathematically, this is equivalent to considering only the first m terms of the power series expansion.Since some random walks may retain important information of the matrix for longer steps than others and it is very difficult to determine a priori the number of steps required for achieving a specific precision, we adopt the following termination criteria: the computation of the random walk will end when the associated weight is less than a relative threshold W c [17].In other words, a random walk terminates at step m when where W (m) is the weight after m steps and W (0) is the weight at the initial step of the random walk.Formally, the infinite series in ( 15) is truncated as with where Ω (k) ij corresponds to the weight W (k) of the random walk that began at state i and arrived at state j after k steps.Here s indicates a realization of a random walk.
Computing Û directly from ( 18) is ill-advised due to the large number of outer products, while also being very difficult to be parallelised efficiently.Instead, let us rearrange (17) to a more suitable form.The random walks with the same starting column can be grouped as where N i is the number of random walks that began at column i.Assuming that N s >> 1, the value of N i can be estimated a priori as N i ≈ p i N s with p i = P(X 0 = i) as defined in (9).Let ν ij denote a visit to the state j at the step k of a random walk that started at state i.For each visit ν ij , the weight of the walk is added to the entry q ij of matrix Q, defined as Then, we can rewrite (18) as Algorithm 2 describes the procedure for approximating f (A) based on equations ( 17) and (21).Assuming that matrix A is sparse with N nz nonzero entries, Algorithm 2 requires O(N s m) operations to construct the matrix Q and O(nN nz ) to calculate the final product AQA, for a total computational cost of order of O(N s m+nN nz ).It also uses an additional n 2 space in memory to store the matrix Q.It is possible to reduce memory consumption if the program divides rows of Q into blocks in such a way that only one block is computed at a time.At the end of each block, the program updates the matrix f (A) and reuses the memory allocation for the next block.

Diagonal of the matrix function
Algorithm 2 can be conveniently modified to compute only the diagonal of the matrix function.Let Q k denote the k-th row of matrix Q defined in (20).Then, the diagonal of f (A) is approximated by vector d = (d i ) as follows Algorithm 2 A probabilistic algorithm for computing the matrix F as an approximation of f (A).N s represents the total number of random walks and W c the weight cutoff.
1: function RandFunm(A, Ns, Wc) 2: for i = 1, 2, . . ., n do ▷ for each column in A 5: N i = Ns P(ℓ 0 = i) ▷ see equation ( 9) 6: for s = 1, 2, . . ., N i do ▷ for each random walk 7: end while 14: end for 15: end for 16: return F 18: end function where ⟨•, •⟩ denotes the inner product and e i a vector from the canonical basis.Essentially, the program updates the value of d i immediately after the computation of row Q k .In this way, only a single row of Q needs to reside in memory at a time.Naturally, if a ik ∼ 0, the program can skip the calculation of Q k in order to save computational resources.Note that multiple entries can be computed at the same time by reusing Q k and then selecting the appropriate entry a ik and column C i .
In terms of computational cost, the computation of all diagonal entries requires O(N s m + N nz nc ) floating-point operations, where nc denotes the average number of nonzero entries per column and consumes an additional n space in memory.The algorithm described in this section will be referred to as RandFunmDiag.

Action of the matrix function over a vector
Let v be a real vector in R n , our goal is to develop another algorithm based on Lemma 2 for computing f (A) v. First, let us multiply the truncated series (17) by the vector v: where Algorithm 3 A probabilistic algorithm for computing y as an approximation of f (A)v.N s represents the total number of random walks and W c the weight cutoff threshold.
1: function RandFunmAction(A, v, Ns, Wc) 2: q = 0 3: for i = 1, 2, . . ., n do ▷ for each column in A 6: N i = Ns P(ℓ 0 = i) ▷ see equation ( 9) 7: for s = 1, 2, . . ., N i do ▷ for each random walk 8: do ▷ compute the k-th step 10: end while 15: end for 16: end for 17: return y 19: end function with r i = ⟨R i , v⟩ and h = Hv.Rearranging the series such that the random walks with the same starting column are grouped: Then, the action of the matrix function f (A) over the vector v can be approximated as with q = (q i ); Algorithm 3 describes the procedure for approximating f (A)v based on equation (25) and the definition of the vector q.It has a time complexity of O(N s m + N nz ) and requires an additional n space in memory to store the vector q.

Convergence of the method and numerical errors
In the following, we prove the convergence of the Monte Carlo method described by equations ( 17) and ( 21) through the following theorem: Theorem 1 Let m be any positive integer.For each k ∈ {0, . . ., m}, let (ξ (k) (s)) s≥1 be a collection of i.i.d.vector-valued random variable in R n 2 defined as ξ converges in distribution to a random vector distributed according to a multivariate normal distribution N [0, Θ].Here Θ = (θ ij ) is the covariance matrix, with θ ij = Cov(v i (s), v j (s)), and v i (s) the i-th component of the random vector V (s).
Proof This proof is based on the proof described in Theorem 3.4 [58] conveniently modified for the current numerical method.Assuming that all random walks are independently generated, then Var(ξ Here, ξ i (s) denotes the i-th component of the random vector ξ (k) (s).To compute the expected value E[(ξ we have to enumerate all the different transitions that occurred between a given initial state ℓ 0 and an arbitrary final state ℓ k of the Markov chain in k steps, along with the corresponding probabilities.This yields, where is the vector obtained after vectorizing the matrix as the i-th component.From equation (12), it follows that Note that (10), then it holds Here β = max j {(g , ∀k for any matrix function of interest, from equation ( 27), we have Since lim k→∞ α |ζ k+1 | |ζ k | < 1, it holds that the series ∞ j=0 |ζ j |α j converges and therefore the variance Var(v i (s)) is bounded.
Note, however, that for the specific case of the exponential function, |ζ k+1 |/|ζ k | = 1/(k + 1), and therefore any value of α is allowed to ensure convergence of the series.However, for the inverse function α < 1 is strictly mandatory.
Since the variance is finite, the Central Limit Theorem for vector-valued random variables (see [56] e.g.) guarantees that □ Therefore, for a finite sample size N s , replacing the expected value in (13) by the arithmetic mean introduces an error which is statistical in nature and known to be distributed according to a normal distribution.From Theorem 1 the standard error of this mean ε can be readily estimated as σ 2 / √ N s , being σ the corresponding standard deviation.

Numerical examples
To illustrate the applicability of our method, we compute the subgraph centrality and total communicability of several complex networks using the matrix exponential e γA with γ ∈ [0, 1].Due to the random nature of the Monte Carlo algorithms, all results reported in this section correspond to the mean value of 10 runs of the algorithm using different random seeds.
All numerical simulations were executed on a commodity server with an AMD EPYC 9554P 64C 3.75GHz and 256GB of RAM, running Fedora 38.All randomized algorithms were implemented in C++ using OpenMP.The code was compiled with AMD AOCC v4.0 with the -O3 and -march=znver4 flags.Our implementation uses the PCG64DXSM [71] random number generator.
The algorithms were tested using two synthetic graphssmallworld and kronecker -as well as a set of networks extracted from real-world data, which are described in Table 1.Note that, before calculating the subgraph centrality and total communicability of directed graphs, their adjacency matrix A must symmetrized as in order to split apart the outgoing and incoming edges of the graph [16].
We also remove all duplicated edges, loops and disconnected nodes from the kronecker graph generated by the Graph500 code [1].

Numerical errors and accuracy
Figs. 1a and 1b show the relative ℓ ∞ error of our method as function of N s .Recall from Section 3.2 that the algorithm assigns a priori N j ∼ ∥C j ∥ 2 random walks to each node j of the graph.Therefore, if N s is too low, very few random walks will be assigned to a node i with a low norm, such that its centrality score is basically approximated by just the first two terms in the series, i.e., SC(i) ≈ 1 + a ii and T C(i) ≈ 1 + (A1) i .For this reason, the relative errors are highly dependent on the structure of the graph.In particular, the nodes in the smallworld network have very similar probabilities, and therefore it can happen that for N s < n some nodes will have no chance to be randomly chosen.Only when the sample size is sufficiently large, the algorithm can properly estimate the centrality of every node.In this scenario, the numerical error of the algorithm scales with O(N −0.5 s ), similar to other probabilistic methods.This relation is confirmed numerically by the trend lines in Fig. 1a  and 1b, which has a slope of approximately −0.5 in the logarithmic scale.Table 2 shows the relative ℓ ∞ error for a fixed number of samples.The table also shows the standard measurement error obtained after several independent runs.Table 2: Relative ℓ ∞ error for N s = 10 8 and W c = 10 −6 as well as the standard measurement error obtained after several independent runs.In Figs.2a and 2b we show the relative ℓ ∞ error as function of W c .The value of W c controls the length of the random walks in terms of the number of steps, which is related to the number of terms of the power series expansion.When the value of W c is large, the algorithm stops the random walk generation too early, leading to large errors.On the other hand, when the value of W c is small, the algorithm continues generating the random walk for longer steps than necessary.Note that this increases the computational cost without necessarily improving the accuracy of the method.In fact, we have to consider also the statistical error, which depends on the number of generated random walks.According to Figs. 2a and 2b, the optimal value for W c is around 10 −6 for these networks.
Figs. 3a and 3b shows how the relative ℓ ∞ error grows with the size of the smallworld and kronecker networks.The smallworld network starts as a ring lattice with n nodes, each connected to 10 neighbours.The algorithm then rewires each edge in the lattice with a 10% chance, i.e., the edge (i, j) is replaced by (i, k) where k is chosen at random from all the possible nodes in     the network such that there are no loops or duplicated edges.As a result, the random walks have very similar weights W (k) independently of the sequence of nodes visited.In other words, the norm of the covariance matrix σ 2 = ∥Θ∥ ∞ is very low.Considering that N s is fixed, there are fewer random walks to estimate the centrality score of each node as the graph size increases, which reduces the precision of the algorithm by a factor of √ n, as shown in Figs.3a and 3b.This is in line with the theoretical results from Section 3.5, where ε r ∼ σ 2 N −0.5 s .In contrast, the nodes in the kronecker graph are organized hierarchically.At the top level, there is a single node acting as the central hub for the entire graph.As we move down the hierarchy, there are more nodes per level, but they have fewer connections.The number of levels in the hierarchy as well as the number of nodes and connections at each level are determined by the size of the graph.Therefore, the weight of the random walks can vary greatly depending on which nodes are visited and their position in the hierarchy.Larger graphs have a higher covariance norm σ 2 due to a wider degree difference between nodes.

Comparison with other methods
There are a few algorithms available in the literature for computing the matrix exponential.Perhaps the most well-known scheme is the expm routine from MATLAB [5,53,54].The method first scales the matrix A by a power of 2 to reduce the norm to order 1, calculates the Padé approximant of the matrix exponential and then repeatedly squares the result to recover the original exponent.For a generic n × n matrix, expm requires O(n 3 ) arithmetic operations and an additional O(n 2 ) space in memory.expm calculate the entire matrix e γA .
To rank the nodes using the subgraph centrality, we only need to calculate the diagonal entries of e γA , not the complete matrix.Methods for estimating the individual entries of the matrix function have been proposed by Golub, Meurant and others [15,43,47] and are based on Gaussian quadrature rules and the Lanczos algorithm.They require O(n) operations to determine each diagonal entry, resulting in a total cost of O(n 2 ) to calculate the subgraph centrality for all nodes.In practice, this method may suffer from numerical breakdowns when A is large and sparse [13,43].For this reason, it is often restricted to estimate only the k most important nodes in the graph.
Likewise, the total communicability only requires the action of f (A) over a vector setting v = 1, which can be computed efficiently using either a polynomial or rational Krylov method [4,34,50,51].These methods consist in generating a Krylov basis using the input matrix and then evaluating the function over the projected matrix through some direct method, such as expm.Assuming a sparse matrix with N nz nonzeros and a Krylov basis with m vectors, the computational cost is O(mN nz ).In particular, we compared our method against the restarted polynomial Krylov [4,34] from the funm_kryl toolbox [49].
While writing this article, Güttel and Schweitzer published a preprint [52] proposing two new randomized algorithms -sFOM and sGMRES -for estimating f (A)v.Here, we focus on sFOM since sGMRES works best with Stieltjes functions and requires a numerical quadrature rule for all the other functions.sFOM first creates a random sketch of the A and then uses an incomplete Arnoldi decomposition to generate a non-orthogonal Krylov basis from this sketch.However, the basis may be ill-conditioned, and thus, for stabilizing the method, it is required to compute a thin QR decomposition (also called whitening [67]) of the basis before evaluating the matrix function over the projected matrix.The computational cost of sFOM is O(N nz m log m + m 3 ).Another preprint by Cortinovis, Kressner and Nakatsukasa [26] was also published recently proposing a different randomization strategy for the Krylov method.They propose an Arnoldi-like decomposition to iteratively build the non-orthogonal Krylov basis V m using only random sketches of the basis vectors.Again, the method may apply a whitening [67] to improve the condition number of the basis.Afterwards, the program solves a least-square problem to obtain the projected matrix.We will denote this algorithm as rand_kryl and it has an overall computational cost of O(N nz m 2 + m 3 ).
The MC represents the Monte Carlo method adapted from [17] as described in Algorithm 1. Similar to our randomized algorithm, the length of the random walks depends on the weight cutoff W c .It has a computational cost of O(mN s ), where m denotes the average number of steps in the random walk, and does not require additional space in memory.This method can be modified to calculate f (A)v or f (A) ii instead of the full matrix function.Note that for the latter, the method still computes the full f (A), but discards all the off-diagonal entries.
Fig. 4 compares the serial execution time and accuracy among the different methods when computing the subgraph centrality.The graphs are sorted according to the number of nodes.The similarities between the two node rankings are measured using the Pearson correlation coefficient [14].Here, cc 1 denotes the correlation coefficient between the top 1% nodes between a reference list and the ranking obtained by the algorithm.Note that, if two or more nodes have similar centrality scores, numerical deviations can alter their ranking order, lowering the correlation coefficient.Nevertheless, these nodes have similar importance within the network, and thus, their order in the ranking may not be relevant to understanding the dynamics of the graph.
Although expm reaches machine precision, it cannot be used for large graphs due to its hefty computational cost and cubic scaling.In fact, the cond-mat graph with 40k nodes is already too large for expm and cannot be executed in a reasonable amount of time.The MC requires a very large number of random walks to estimate the subgraph centrality as it only updates a single entry of the matrix exponential at a time, and it is more likely for this entry to be outside the diagonal if the matrix is very large.For this reason, the accuracy of MC is quite poor even with a large number of random walks.Both RandFunm and RandFunmDiag have the same accuracy and correlation since the core algorithm is the same.However, RandFunmDiag does not require the computation of the full matrix product at the end of the algorithm, resulting in a speedup between 3 to 24 over RandFunm.Moreover, the full matrix exponential of the twitch and stanford graphs are too large to be fully represented in memory, and thus, their subgraph centrality can only be calculated by RandFunmDiag.The randomized algorithms also show a very high correlation for top-1% nodes in the ranking compared with the reference list.Fig. 5 compares RandFunmAction against all Krylov-based methods in terms of the serial execution time and accuracy.The tolerance of funm_kryl was set to 10 −8 and the size of the Krylov basis of sFOM and rand_kryl was set to 4, such that all algorithms have comparable precision.We choose a small value of γ to avoid overflow as we are working with large positive matrices, and consequently, all algorithms converge very fast to target precision.
The stopping criterion of funm_kryl is well-known to be pessimistic [51], resulting in much higher precision than the target at the cost of higher execution times.In some networks (twitch, orkut and twitter), sFOM and rand_kryl outperformed funm_kryl mainly due to the smaller basis, while in others (uk-2005), the additional cost associated with the sketching, whitening, least square QR and other operations lead to significant slowdowns.Considering the accuracy difference, the randomization in the Krylov method does not seem to be very effective when the basis is relatively small.RandFunmAction shows the best performance among all algorithms, in particular, for the twitter network, being 3.8× faster than funm_kryl, while sFOM and rand_kryl are 2.7× and 1.7× faster, respectively.
For complex networks, it suffices for the algorithm to be sufficiently accurate to differentiate the centrality score between all nodes in the graph, there is no benefit in having higher accuracy.Indeed, the ranking produced by RandFunmAction has a correlation greater than 0.95 for the top 1% nodes despite having a lower accuracy than the others.The only exception is the twitter network.As a massive social network, there is no clear structure or hierarchy in the graph, such that many of them have similar centrality scores and small numerical variations can drastically change the rank order.In comparison, uk-2005 is an equally large web graph that follows a more clear structure with a well-defined hub and authorities, and thus, it is less susceptible to noise.For this reason, the correlation for the twitter network is much lower than other graphs and requires greater precision to differentiate the nodes.Note that the top-1% in the ranking contains almost 1 million nodes in both the twitter and uk-2005 networks with a wide range of centrality scores.If we consider only the top-0.1%, the correlation of RandFunmAction for the twitter network increases to 0.79.
Figs. 6b, 6a, 7b and 7a show the elapsed serial time as a function of the number of nodes for the kronecker and smallworld networks.The computational cost of the expm algorithm is of the order of O(n 3 ).This is regardless of the sparsity or the distribution of the nonzeros of the matrix since it was originally proposed for dense matrices.The MC algorithm was too slow for the prescribed accuracy, and thus, it was not included in the graph.When N nz ∼ n the computational cost of the RandFunm and RandFunmDiag algorithms become of order O(n 2 ).This is because the computational cost of computing the matrix product is higher than generating random walks.Note that this cost is similar to other algorithms [43] for estimating diagonal entries of the matrix.However, the main advantage here is the smaller proportionality constant and the capability to compute the subgraph centrality for sparse and large matrices without worrying about a numerical breakdown [43].Again, the     RandFunmDiag algorithm is significantly faster than the RandFunm algorithm as it only requires the partial evaluation of the matrix product at the end of the algorithm.All Krylov-based methods scale linearly with n as they rely on matrix-vector multiplications.Similar to the other Monte Carlo methods, RandFunmAction spends more time computing the matrix-vector product than generating the random walks, and thus, also scales linearly with n.

Single entry
One of the main advantages of Monte Carlo algorithms is the ability to calculate a single entry of the solution without requiring the computation of the full    solution.Table 3 shows the relative ℓ ∞ error for calculating the total communicability of the node with the highest degree.Both RandFunmAction and MC were modified to calculate a single entry of the solution as efficiently as possible.Due to the ability to sample entire rows and columns, RandFunmAction produces a much better approximation for (f (A)v) i than MC for the same number of random walks, independently of the network.

Parallel Performance
Parallelizing our randomized algorithm is fairly straightforward.Multiple rows of Q can be computed at the same time in the RandFunmDiag algorithm, yet the diagonal entries must be updated atomically to avoid data races.Likewise, the RandFunmAction algorithm can compute the vector q as well as the final product Aq completely in parallel.Figure 8 show the strong scaling for the parallel implementation of the RandFunmDiag and RandFunmAction algorithms, respectively.The scalability of both algorithms is excellent, attaining more than 85% in efficiency for most networks when using 64 cores.In particular, the parallel code was able to achieve near-perfect scaling for the smallworld network due to its low degree per node and an almost uniform structure.This leads to a more efficient usage of the cache as well as an even distribution of load across the processors.
In most networks, the random walks are not distributed equally across the nodes, such that some rows of Q and entries of q take longer to compute than others.To solve this load imbalance, the program dynamically distributes the vector q and matrix Q over the CPU cores.This solution was very effective for most networks, improving significantly the performance of the program.Yet, the CPU may still be underutilized at the end of the code if the graph is very unbalanced.This is the case of directed graphs due to the symmetrization of the adjacent matrix as shown in (33).
Another limiting factor is the latency and bandwidth of the main memory.Most operations with large and sparse matrices are well-known to be memorybound as they cannot utilize the cache hierarchy effectively while requiring additional logic and memory accesses for handling the sparse storage format.In fact, the kryl_funm algorithm shows no benefits when running in a multithreaded environment since it relies on sparse matrix-vector products and the majority of the code is written in MATLAB.In contrast, our randomized algorithm only needs to compute two sparse matrix products: one at the beginning and another at the end of the algorithm.This still affects the scalability of the method when working with massive networks, such as twitter and uk-2005.Even under these conditions, the program was able to obtain significant speedups when using 64 cores, achieving 60% efficiency for twitter and 70% for uk-2005.

Katz Centrality
One of the most well-known centrality measures based on matrix functions is Katz's Centrality (KC) [55,60].It is defined as (I−γA)x = 1 with KC(i) = x i as the centrality score of the node i.Here, the γ is called attenuation factor and should be between 0 and ρ(A) −1 , where ρ(A) is the spectral radius of A [60].Different information can be extracted from the network by changing the value of γ [19].For instance, if γ tends to ρ(A) −1 , Katz's Centrality approximates the eigenvalue centrality [23,24].If γ tends to 0, then it converges to the degree centrality.
There are several ways to solve the linear system (I − γA)x = 1.Direct solvers, such as MUMPS [8,9] or Intel Pardiso [74], first compute the LU factorization or similar and then solve the linear system using backward/forward substitution.However, factorization is a very costly procedure, scaling with O(n 3 ), while also requiring additional space to store matrices L and U. On the other hand, sparse iterative solvers, such as Conjugate Gradient, GMRES, and BiCGSTAB, can converge very quickly to the solution especially provided a good preconditioning is available.Last, but not least, Monte Carlo methods solve the linear system as the truncated series with v = 1.However, this series only converges when ρ(γA) < 1.Moreover, if ρ(γA) is near 1, the convergence rate will be very slow, and thus, requires computing many terms of the expansion to reach a reasonable accuracy.
Fig. 9 compares RandFunmAction against the original Monte Carlo method (MC) and a simple Conjugate Gradient algorithm (CG).Here, the error ε of the CG is equal to residual norm ∥1 − Ax∥ 2 , while for RandFunmAction and MC, it corresponds to ℓ ∞ error using the results from the CG as reference.Note that we avoid the costly computation of the eigenvalue by leveraging the Gershgorin's Theorem [46], i.e., ρ(γA) ≈ max i k |γa ik |.Again, RandFunmAction is faster and more accurate than MC for the same number of random walks N s , yet it still is not as good as the sparse iterative solver.Even without preconditioning, CG can converge extremely quickly to the solution due to the small value of γ.It is possible to enhance the performance of our methods by combining it with a Richardson iteration as shown in [17].This is left for future work.
We want to emphasize that Monte Carlo methods are better suited to evaluate other matrix functions than the matrix inverse, whereas this is either too expensive or is not even possible with classical methods.

Conclusion
This paper proposes a novel stochastic algorithm that randomly samples rows and columns of the matrix for approximating different powers of the power series expansion.It can evaluate any matrix function by using the corresponding coefficients of the series.The algorithm can be conveniently modified to compute either f (A)v or the diagonal of f (A) without the need to compute the entire matrix function.As a way to test the applicability of our method, we compute the subgraph centrality and total communicability of several large networks using the matrix exponential.Within this context, the stochastic algorithm has proven to be particularly effective, outperforming the competition.Our method also is highly scalable in a multithreaded environment, showing remarkable efficiency when using up to 64 cores.
In this paper, we primarily focus on the analysis of complex networks as it provided a very close relation with the method itself, but the algorithm can be applied to any scientific problem that can be expressed in terms of matrix functions, providing a quick way to estimate the solution of the problem with reasonable accuracy.

Competing interests
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Fig. 1 :
Fig.1: Relative ℓ ∞ error as function of the number of random walks N s for W c = 10 −6 .The degree γ of the polynomial of the fitting curve is indicated as ε r ∼ N γ s .

Fig. 2 :r
Fig. 2: Relative ℓ ∞ error as function of the weight cutoff W c for N s = 10 8 .

Fig. 3 :
Fig. 3: Relative ℓ ∞ error as function of the number of nodes n of the graph considering W c = 10 −6 and N s = 10 8 .The degree γ of the polynomial of the fitting curve is indicated as ε r ∼ n γ .

Fig. 4 :
Fig.4: Comparison between different algorithms when calculating the subgraph centrality for real networks and γ = 10 −3 .Here, we consider the centrality scores generated by our algorithm with N s = 10 11 as reference.

Fig. 5 :
Fig.5: Comparison between different algorithms when computing the total communicability for real networks and γ = 10 −5 .Here, we consider the centrality scores generated by expmv[6] as reference.

Fig. 6 :
Fig. 6: Elapsed time for computing the subgraph centrality as a function of the number of nodes n for a fixed accuracy ε r and γ = 10 −3 .

Fig. 7 :
Fig. 7: Elapsed time for computing the total communicability as a function of the number of nodes n for a fixed accuracy ε r and γ = 10 −5 .

Fig. 8 :
Fig. 8: Strong scaling for W c = 10 −8 and a fixed number of random walks.

Fig. 9 :
Fig. 9: Comparison among the different algorithms when computing the katzcentrality for γ = 0.85 max i k |a ik | and N s = 10 9 .The execution time is measured using all 64 cores.

Table 3 :
Relative error for calculating the total communicability of a single node i = arg max j k |a jk | considering N s = 10 8 , W c = 10 −6 and γ = 10 −5 .