Path-integral formulation of spreading processes in complex networks

We introduce a path-integral formulation of network-based measures that generalize the concept of geodesic distance and that provides fundamental insights into the dynamics of disease transmission as well as an efficient numerical estimation of the infection arrival time.


Introduction
The forecast and control of emergent diseases has become particularly important in recent years because of the increasingly growing structure and velocity of transportation means. A series of papers have been devoted to the problem of infection arrival times in complex networks [2,3,6,7]. Most of them are based on the assumption that a single dominant path, associated with maximal traffic probability, is sufficient to estimate accurately the arrival time of a diffusive process. While this is partially true in some specific cases, when a single path between each pair of nodes is available, in the general scenario this assumption can give poor estimates for the arrival time.
Effective distances (ED) in the dominant-path approach can be defined, for both directed and undirected networks, as the geodesic graph distance of a weighted graph with edge weights given by the first moment of a distribution known from extreme events statistics [8], which depends only on the network topology and on the transmission and recovery rates. This approach has the disadvantage that it can significantly overestimate the infection arrival time obtained numerically [5,6]. In addition, in situations where multiple equiprobable paths exist between node pairs, as in regular lattices, this approach breaks down. A more realistic scenario takes into account all possible propagation routes [7] yielding a multiple-path ED, which is supposedly the best possible estimate of infection arrival times. Unfortunately, the computation becomes infeasible as the number of paths between two nodes grows exponentially with the size of the network. The lack of a practical computational approach leads back to considering only the dominant path.
Here, we introduce a path-integral formulation that generalizes the multiple-path ED and that can be used a e-mail: flavio.iannelli@business.uzh (corresponding author) as a computationally feasible alternative with a clear interpretation, paving the way for future studies on spreading processes in complex networks.
The paper is organized as follows. In Sect. 2.1, we define the dominant-path ED by considering nodeindependent exit rates. In Sect. 2.2, we present the generalization to multiple paths of arbitrary length connecting two nodes. In Sect. 2.3, we follow the randomwalk approach defining the path-integral formulation of the problem. Then using a metapopulation model, we compare the path-integral formulation of ED with the infection arrival times. In Sect. 3, we summarize our results and present our conclusions.

Dominant path
The fundamental metric in networks is the geodesic distance, i.e. the shortest-path length over all paths {Γ ij } connecting node i to node j. For weighted networks, where the weights W ij are positive numbers identifying the carrying capacity of a certain route, each edge (k, l) ∈ Γ ij contributes to the total length with its reciprocal weight. This generalizes the standard definition for unweighted networks as The reciprocal of the weights is used consistently with the fact that a higher flux of passengers along an edge reduces the distance between the respective nodes. Starting from the heuristic definition equation (1), it is possible to extend the notion of distance by replacing the weights with an effective function of the weights f (W kl ) that quantitatively reproduces the distance as measured by the arrival time of spreading processes unfolding on a given network. In [3], the authors proposed the ansatz where P kl = W kl / m W km is the transition probability to navigate the graph via random walks. The choice for the logarithm is motivated by the authors simply by requiring the additivity of the distance equation (2), consistently with multiplying the corresponding probabilities. Although this ED is able to reproduce to a good approximation the infection arrival time in the GMN, its interpretation is quite obscure and the particular choice of the logarithm function is not derived from a microscopic description of the process. Furthermore, the choice for the constant term equal to unity in its definition is arbitrary and it is not clear a priori why such expression would correctly quantify the arrival times of reaction-diffusion processes. Most importantly, the definition equation (2) suffers from the bias of considering a single (probability-dominant) path, neglecting all other routes for information transfer. An alternative and refined approach to derive analytically the ED using a detailed kinetic description of spreading processes was outlined in [7]. Interestingly, Eq. (2) can be obtained as a special case of a more general quantity. To show this, one needs to extend and generalize the Markovian description presented in [7] and define the ED using a dominant-path approach, by requiring the maximization of the travel probability between adjacent subpopulations, which can be expressed in terms of the matrix P ij . The main difference from the existing result is that we assume that the exit rate of each node q i = l =i Q il is independent on the node location, i.e. q i = α for all nodes. After some rearrangements, one obtains an expression with the same form of Eq. (2), but with an additional dependence on the spreading parameters [9].
This yields the dominant-path ED Here we have defined where β and μ are the infection and recovery rates while γ e ≈ 0.577 is the Euler-Mascheroni constant. From Eq. (3), we can recover the ansatz for the ED equation (2) simply by setting λ = 1. However, in the dominant-path ED equation (3), the definition of λ as a function of the epidemic and mobility parameters gives the optimized edge weight that should contribute into the minimization condition over all paths connecting source and target. On the computational side, one can obtain the full matrix D DP ij using the Dijkstra algorithm in a time O(NE + N 2 log N ), where E and N are the graph size and order, respectively. The most important limitation of Eq. (3) is that only the path that minimizes the topological length and at same time maximizes the associated probability is considered. It turns out that, because of this limitation, the infection arrival time [3], is overestimated with respect to the numerical arrival times obtained from direct simulations [5,6]

Multiple paths
The correct approach is to consider the multiplicity of transmission routes. The framework to include all directed paths of transmission was developed in [7]. For simplicity, we start by considering only two distinct paths Γ and Γ connecting the same pair of nodes. The two-path ED D 2P ij that generalizes the dominant-path ED equation (3) satisfies the equation where is the ED, which is the mean of a Gumbel distribution, associated with a path Γ ij of arbitrary length connecting node i to node j. Equation (5) can be easily generalized to an arbitrary number of paths connecting the same pair of nodes, as The previous equation defines the multiple-path ED where the total probability associated to the path Γ ij of length n( An analogous expression can be obtained by grouping all probabilities associated to paths of same length into the quantity F ij (n) = |Γ |=n P (Γ ij ). Then we can replace the sum over all paths connecting i to j in Eq. (8) with a sum over the allowed path lengths where n max is the maximum path length in the network.
If instead of considering all paths in Eq. (8) we select the single path Γ * ij of length n(Γ * ij ) = |Γ * ij | that is associated to the dominant contribution, i.e. the path that maximizes its associated probability and minimizes the topological path length, we recover the dominant-path ED equation (3), i.e.
The multiple-path ED provides an accurate estimate of the infection arrival time, as it counts the most probable route as well as all possible alternative-directed transmission routes. However, a major drawback of this approach is that it is computationally not tractable.
In fact, the total number of paths between i and j is bounded but exponentially growing with the number of nodes N , thus the measure D MP ij becomes useless for large networks [13]. A trade-off between performance and accuracy can be achieved by restricting the path search algorithm to a maximum path length or more elegantly, as we show next, by relaxing the assumption of direct propagation and define a path-integral formulation of the problem.

Random walks
Both measures introduced in the previous sections D DP ij and D MP ij rely on the fact that the epidemic will spread along paths, i.e. routes that do not ever cross. Here we follow a different approach and introduce an ED that includes all possible random walks from source to target. Relaxing the assumption of directed spread is equivalent to effectively erasing the memory from the system at each time step. This is achieved by including in Eq. (8) all walks {Ξ} that, contrary to the paths {Γ }, allow also walking on already visited nodes. We define the random-walk effective distance (RWED) by generalizing Eq. (8) as where P (Ξ ij ) = (k,l)∈Ξij P kl is the total probability associated to the walk Ξ ij of length n(Ξ ij ) = |Ξ ij |. We note that since {Ξ} is a bigger set than {Γ }, the following inequalities hold: As we did in the previous section for paths with probability P (Γ ), we can group the probabilities associated to walks of the same length into the quantity H ij (n) = |Ξ|=n H ij (Ξ ij ). The latter is precisely the hitting time probability defined recursively for i = j as H ij (n) = k =j P ik H kj (n). Contrary to the multiplepath scenario, walks are unbounded and so becomes the maximum length n max in Eq. (10). Substituting, we rewrite the RWED as Remarkably, there is a immediate interpretation of the RWED in terms of the hitting-times generating functions. Indeed by definition where n ij is the random-walk hitting time to node j [10] and the average is taken over all random walks that start in i and terminate as soon as they hit node j. It is easy to see that where n k c are the hitting time cumulants and Ψ ij (λ) is the cumulant generating function of the random-walk hitting time. Hence, one obtains the cumulants of the random-walk hitting time by differentiating Eq. (14) with respect to λ. In particular, the mean-first-passage time (MFPT) from i to j is obtained as M ij = n ij c = ∂ λ D RW ij | λ=0 . To compute D RW ij (λ), we can write H ij (n) in terms of powers of the transition probability sub-matrix P (j) obtained by removing the jth row and column, as for the computation of the MFPT. Assuming a positive λ, the expansion in a geometric series converges 1 and we obtain Here, I (j) is the (N − 1) × (N − 1) sub-identity matrix and p (j) is the jth column of P with jth component removed that takes into account the last step needed to reach the target j. Next we show how we can disentangle the dominant contribution defined by D DP ij in the RWED, using a path-integral formulation [14] of ED. Indeed, we can rewrite the RWED equation (12) as where A(Ξ, λ) is interpreted as the Euclidean action defined as As previously, n(Ξ) is the length of the walk Ξ and P (Ξ) is the associated probability. In this picture, the RWED is the the free energy functional (changed in sign) of an Euclidean field theory [11,14]. The dominant-path ED is instead described by Eq. (11). In contrast to Eq. (18), the dominant-path ED neglects contributions of paths other than the one that maximizes the associated probability. Then we can think of D DP ij as an effective action 2 where Γ * ij is the (probability) dominant path of length n(Γ * ij ). Expanding the action around the dominant contribution as and plugging this into Eq. (18) we find In the previous equation δΞ quantifies the deviation of the walk Ξ with respect to the dominant path Γ * . By comparing with the ansatz D eff ij defined by Eq. (2), we can finally identify where the effective action is defined by Eq. (20). The additional contributions in the RWED with respect to the effective action, given by the logarithmic term on the right-hand side of Eq. (22), account for the multiplicity of transmission routes. The latter quantifies the discrepancy between the random-walk and dominant-path approach. We expect this contributions to be negligible for network topologies that are locally tree-like, where a single path connects any pair of nodes. For many real-world networks, this is the case since the degree distribution behaves as a power-law and the navigation is dominated by the presence of large hubs [4]. However, we expect the dominantpath approach to completely break down when the multiplicity of paths becomes relevant as in the case of random (Poissonian) networks and regular lattices where a large number of paths are equally probable. In this case, the difference |D RW ij − D DP ij | will be non-negligible and the arrival time estimates will substantially decrease when considering the random-walk approach.
As the substrate for epidemic spreading, we consider the global mobility network (GMN), where nodes are different airports and weighted edges represents the number of seats on scheduled commercial flights between them. In Fig. 1, we show the GMN by drawing an arc for each link connecting two airports, color coded according to the flux of passengers on that connection. Assuming that the number of seats on scheduled commercial flights is on average proportional to the number of passengers traveling, the data can be represented as a weighted network with adjacency matrix W ij giving the total traffic per day between airport i and airport j.
To each airport j (a node in the metapopulation) is associated a subpopulation of size N j so that W ij = Q ij N i , where Q ij is the transition rate from i to j. Although the network is directed, the degree of asymmetry in the weighted adjacency matrix W ij is very low. This particular feature is well confirmed also in similar empirical datasets of air-traffic at the global scale [1,3]. We estimate quantitatively the degree of asymmetry by looking at both the topological and weights asymmetry. The former is quantified by the average number of non-zero elements in the corresponding unweighted adjacency matrix A ij = χ(W ij ), where χ is the step function equal to one for positive arguments and vanishing otherwise. Instead the weight asymmetry w is defined as the normalized net difference between travel fluxes in each route and the corresponding reversed travel. We find = 2 · 10 −3 and w = 3.1 · 10 −9 . Thus, we redefine the a symmetric adjacency matrix as W ij = (W ij + W ji )/2. Symmetrizing the weights also assures us that the subpopulation sizes are conserved quantities sinceṄ i = k (W ki − W ik ). In principle, the matrix W ij is provided by traffic data and N i by census data such that the rates Q ij , for i = j, can be computed inverting W ij = Q ij N i . However, although it is straightforward to measure W ij , assessing the effective population is more subtle. The number of individuals that effectively participate in the dispersal N i is not necessarily the same as the population data provided by census. We also assume that the exit rate q i = l =i Q il is independent of the node i and reformulate the diffusion contribution in terms of the transition matrix P ij and a constant diffusion rate α. The latter is given by the Fig. 1 The global mobility network (GMN) of air-traffic. Each edge corresponds to a scheduled commercial flight, with gradient scaling from dark to light-blue according to the available number of seats ratio α = W/N , between the total flux (per unit time) W = ij W ij and the total population N = i N i . In terms of α, we can remove the dependence on the subpopulation size N i from the reaction-diffusion equations.
For the numerical simulations, we use the SIR metapopulation model with transmission and recovery probability per time step given by the rates β and μ, respectively. We denote by S j , I j and R j the number of individuals in subpopulation j who are susceptible, infected, and removed, respectively. The normalized quantities are ρ X j = X j /N j , where X ∈ {S, I, R}, and N j is the subpopulation size. The full SIR metapopulation model is obtained by combining a diffusion term Ω({ρ X }) = α k P ik ρ X k − ρ X i , with a global exit rate α, with the compartmental SIR reactions in a unified picture that readṡ The key quantity we are interested in estimating is the infection arrival time, defined for each pair of node by the matrix where the subscript i implies that the process started in node i at time t = 0, and by definition T ii = 0. We use T ij as the benchmark of infection arrival times in real-world pandemic scenarios and compare it to the ED defined by Eq.

Discussion
In this paper, we have introduced a path-integral formulation of network effective distances by relaxing the assumption of simple-path propagation of spreading processes. These results provide additional insights into the relation between random-walk-based metrics and epidemic spreading in complex networks. The proposed RWED includes the previously defined dominant-path measure as a particular case. The almost perfect correlation found with the infection arrival time can be explained as follows. The contribution of looped trajectories for diffusion is dramatically reduced because of the decreasing exponential in the walk length in Eq. (12). The latter damps all contributions of very long walks, and in particular allows us to neglect the contributions of infinite loops. In scenarios where multiple parallel paths are important, for instance in Erdos-Renyi graphs or regular lattices, the assumption of a single dominant path breaks down and the measure proposed here can be used as an efficient alternative. The predictive power of the RWED can be used for containment strategies and estimation of arrival times for real global pandemics from the underlying networks topology. The method can in fact be generally applied to any weighted and directed network besides transportation networks, e.g. social networks.
It remains intriguing to test if this is the case also for rumor spreading dynamics. For unweighted locally tree-like networks both the dominant-path ED and RWED yield maximum correlation with the simulated arrival time, as the dominant path tends to dominate. Our results show that infection arrival time in networked systems and heuristic definitions of ED can be obtained within a path-integral formulation of the Right: corresponding plot in the hidden space of RWED, where the epidemic spreads as a highly correlated circular wave centered at the infection seed problem and with the definition of suitable effective actions.
Funding Open Access funding enabled and organized by Projekt DEAL.

Data Availability Statement
This manuscript has no associated data or the data will not be deposited. [Authors' comment: This is a theoretical study and no experimental data.] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.