Skip to main content

Scalable Estimation of Epidemic Thresholds via Node Sampling

Abstract

Infectious or contagious diseases can be transmitted from one person to another through social contact networks. In today’s interconnected global society, such contagion processes can cause global public health hazards, as exemplified by the ongoing Covid-19 pandemic. It is therefore of great practical relevance to investigate the network transmission of contagious diseases from the perspective of statistical inference. An important and widely studied boundary condition for contagion processes over networks is the so-called epidemic threshold. The epidemic threshold plays a key role in determining whether a pathogen introduced into a social contact network will cause an epidemic or die out. In this paper, we investigate epidemic thresholds from the perspective of statistical network inference. We identify two major challenges that are caused by high computational and sampling complexity of the epidemic threshold. We develop two statistically accurate and computationally efficient approximation techniques to address these issues under the Chung-Lu modeling framework. The second approximation, which is based on random walk sampling, further enjoys the advantage of requiring data on a vanishingly small fraction of nodes. We establish theoretical guarantees for both methods and demonstrate their empirical superiority.

Introduction

Infectious diseases are caused by pathogens, such as bacteria, viruses, fungi, and parasites. Many infectious diseases are also contagious, which means the infection can be transmitted from one person to another when there is some interaction (e.g., physical proximity) between them. Today, we live in an interconnected world where such contagious diseases could spread through social contact networks to become global public health hazards. A recent example of this phenomenon is the Covid-19 outbreak caused by the so-called novel coronavirus (SARS-CoV-2) that has spread to many countries (Huang et al. 2020; Zhu et al. 2020; Wang et al. 2020; Sun et al. 2020). This recent global outbreak has caused serious social and economic repercussions, such as massive restrictions on movement and share market decline (Chinazzi et al. 2020). It is therefore of great practical relevance to investigate the transmission of contagious diseases through social contact networks from the perspective of statistical inference.

Consider an infection being transmitted through a population of n individuals. According to the susceptible-infected-recovered (SIR) model of disease spread, the pathogen can be transmitted from an infected person (I) to a susceptible person (S) with an infection rate given by β, and an infected individual becomes recovered (R) with a recovery rate given by μ. This can be modeled as a Markov chain whose state at time t is given by a vector \(({X^{t}_{1}}, \ldots , {X^{t}_{n}})\), where \({X^{t}_{i}}\) denotes the state of the ith individual at time t, i.e., \({X^{t}_{i}} \in \{S, I, R\}\). For the population of n individuals, the state space of this Markov chain becomes extremely large with 3n possible configurations, which makes it impractical to study the exact system. This problem was addressed in a series of three seminal papers by Kermack and McKendrick (1927, 1932, 1933). Instead of modeling the disease state of each individual at at a given point of time, they proposed compartmental models, where the goal is to model the number of individuals in a particular disease state (e.g., susceptible, infected, recovered) at a given point of time. Since their classical papers, there has been a tremendous amount of work on compartmental modeling of contagious diseases over the last ninety years (Hethcote, 2000; Van den Driessche and Watmough, 2002; Brauer and Castillo-Chavez, 2012).

Compartmental models make the unrealistic assumption of homogeneity, i.e., each individual is assumed to have the same probability of interacting with any other individual. In reality, individuals interact with each other in a highly heterogeneous manner, depending upon various factors such as age, cultural norms, lifestyle, weather, etc. The contagion process can be significantly impacted by heterogeneity of interactions (Meyers et al. 2005; Rocha et al. 2011; Galvani and May, 2005; Woolhouse et al. 1997), and therefore compartmental modeling of contagious diseases can lead to substantial errors.

In recent years, contact networks have emerged as a preferred alternative to compartmental models (Keeling, 2005). Here, a node represents an individual, and an edge between two nodes represent social contact between them. An edge connecting an infected node and a susceptible node represents a potential path for pathogen transmission. This framework can realistically represent the heterogeneous nature of social contacts, and therefore provide much more accurate modeling of the contagion process than compartmental models. Notable examples where the use of contact networks have led to improvements in prediction or understanding of infectious diseases include Bengtsson et al. (2015) and Kramer et al. (2016).

Consider the scenario where a pathogen is introduced into a social contact network and it spreads according to an SIR model. It is of particular interest to know whether the pathogen will die out or lead to an epidemic. This is dictated by a set of boundary conditions known as the epidemic threshold, which depends on the SIR parameters β and μ as well as the network structure itself. Above the epidemic threshold, the pathogen invades and infects a finite fraction of the population. Below the epidemic threshold, the prevalence (total number of infected individuals) remains infinitesimally small in the limit of large networks (Pastor-Satorras et al. 2015). There is growing evidence that such thresholds exist in real-world host-pathogen systems, and intervention strategies are formulated and executed based on estimates of the epidemic threshold. (Dallas et al. 2018; Shulgin et al. 1998; Wallinga et al. 2005; Pourbohloul et al. 2005; Meyers et al. 2005). Fittingly, the last two decades have seen a significant emphasis on studying epidemic thresholds of contact networks from several disciplines, such as computer science, physics, and epidemiology (Newman 2002; Wang et al. 2003; Colizza and Vespignani 2007; Chakrabarti et al. 2008; Gómez et al. 2010; Wang et al. 2016, 2017). See Leitch et al. (2019) for a complete survey on the topic of epidemic thresholds.

Concurrently but separately, network data has rapidly emerged as a significant area in statistics. Over the last two decades, a substantial amount of methodological advancement has been accomplished in several topics in this area, such as community detection (Bickel and Chen, 2009; Zhao et al. 2012; Rohe et al. 2011; Sengupta and Chen, 2015), model fitting and model selection (Hoff et al. 2002; Handcock et al. 2007; Krivitsky et al. 2009; Wang and Bickel, 2017; Yan et al. 2014; Bickel and Sarkar, 2016; Sengupta and Chen, 2018), hypothesis testing (Ghoshdastidar and von Luxburg 2018; Tang et al. 2017a, 2017b; Bhadra et al. 2019), and anomaly detection (Zhao et al. 2018; Sengupta, 2018; Komolafe et al. 2019), to name a few. The state-of-the-art toolbox of statistical network inference includes a range of random graph models and a suite of estimation and inference techniques.

However, there has not been any work at the intersection of these two areas, in the sense that the problem of estimating epidemic thresholds has not been investigated from the perspective of statistical network inference. Furthermore, the task of computing the epidemic threshold based on existing results can be computationally infeasible for massive networks. In this paper, we address these gaps by developing a novel sampling-based method to estimate the epidemic threshold under the widely used Chung-Lu model (Aiello et al. 2000), also known as the configuration model. We prove that our proposed method has theoretical guarantees for both statistical accuracy and computational efficiency. We also provide empirical results demonstrating our method on both synthetic and real-world networks.

The rest of the paper is organized as follows. In Section 2, we formally set up the problem statement and formulate our proposed methods for approximating the epidemic threshold. In Section 3, we describe the theoretical properties of our estimators. In Section 4, we report numerical results from synthetic as well as real-world networks. We conclude the paper with discussion and next steps in Section 5.

Epidemic Thresholds

Table 1 lists the common symbols used throughout the paper. Consider a set of n individuals labelled as 1,…, n, and an undirected network (with no self-loops) representing interactions between them. This can represented by an n-by-n symmetric adjacency matrix A, where A(i, j) = 1 if individuals i and j interact and A(i, j) = 0 otherwise. Consider a pathogen spreading through this contact network according to an SIR model. From existing work (Chakrabarti et al. 2008; Gómez et al. 2010; Prakash et al. 2010; Wang et al. 2016, 2017), we know that the boundary condition for the pathogen to become an epidemic is given by

$$ \frac{\beta}{\mu} = \frac{1}{\lambda(A)}, $$
(1)

where λ(A) is the spectral radius of the adjacency matrix A.

Table 1 Common symbols

The left hand side of Eq. 1 is the ratio of the infection rate to the recovery rate, which is purely a function of the pathogen and independent of the network. As this ratio grows larger, an epidemic becomes more likely, as new infections outpace recoveries. The right hand side of Eq. 1 is the spectral radius of the adjacency matrix, which is purely a function of the network and independent of the pathogen. Larger the spectral radius, the more connected the network, and therefore an epidemic becomes more likely. Thus, the boundary condition in Eq. 1 connects the two aspects of the contagion process, the pathogen transmissibility which is quantified by β/μ, and the social contact network which is quantified by the spectral radius. If \(\frac {\beta }{\mu } < \frac {1}{\lambda (A)}\), the pathogen dies out, and if \(\frac {\beta }{\mu } > \frac {1}{\lambda (A)}\), the pathogen becomes an epidemic.

Given a social contact network, the inverse of the spectral radius of its adjacency matrix represents the epidemic threshold for the network. Any pathogen whose transmissiblity ratio is greater than this threshold is going to cause an epidemic, whereas any pathogen whose transmissiblity ratio is less than this threshold is going to die out. Therefore, a key problem in network epidemiology is to compute the spectral radius of the social contact network.

Problem Statement and Heuristics

Realistic urban social networks that are used in modeling contagion processes have millions of nodes (Eubank et al. 2004; Barrett et al. 2008). To compute the epidemic threshold of such networks, we need to find the largest (in absolute value) eigenvalue of the adjacency matrix A. This is challenging because of two reasons.

  1. 1.

    First, from a computational perspective, eigenvalue algorithms have computational complexity of Ω(n2) or higher. For massive social contact networks with millions of nodes, this can become too burdensome.

  2. 2.

    Second, from a statistical perspective, eigenvalue algorithms require the entire adjacency matrix for the full network of n individuals. It can be challenging or expensive to collect interaction data of n individuals of a massive population (e.g., an urban metropolis). Furthermore, eigenvalue algorithms typically require the full matrix to be stored in the random-access memory of the computer, which can be infeasible for massive social contact networks which are too large to be stored.

The first issue could be resolved if we could compute the epidemic threshold in a computationally efficient manner. The second issue could be resolved if we could compute the epidemic threshold only using data on a small subset of the population. In this paper, we aim to resolve both issues by developing two approximation methods for computing the spectral radius.

To address these problems, let us look at the spectral radius, λ(A), from the perspective of random graph models. The statistical model is given by \(A \sim P\), which is short-hand for \(A(i,j) \sim \text {Bernoulli}(P(i,j))\) for 1 ≤ i < jn. Then λ(A) converges to λ(P) in probability under some mild conditions (Chung and Radcliffe, 2011; Benaych-Georges et al. 2019; Bordenave et al. 2020). To make a formal statement regarding this convergence, we reproduce below a slightly paraphrased version (for notational consistency) of an existing result in this context.

Lemma 1 (Theorem 1 of Chung and Radcliffe (2011)).

Let

$$ {\Delta} = \underset{1 \le i \le n}{\max} \sum\limits_{j=1}^{n} P(i,j) $$

be the maximum expected degree, and suppose that for some 𝜖 > 0,

$$ {\Delta} > \frac{4}{9} \log(2n/\epsilon) $$

for sufficiently large n. Then with probability at least 1 − 𝜖, for sufficiently large n,

$$ |\lambda(A) - \lambda(P)| \le 2\sqrt{\Delta \log(2n/\epsilon)}. $$

To make note of a somewhat subtle point: from an inferential perspective it is tempting to view the above result as a consistency result, where λ(P) is the population quantity or parameter of interest and λ(A) is its estimator. However, in the context of epidemic thresholds, we are interested in the random variable λ(A) itself, as we want to study the contagion spread conditional on a given social contact network. Therefore, in the present context, the above result should not be interpreted as a consistency result.

Rather, we can use the convergence result in a different way. For massive networks, the random variable λ(A), which we wish to compute but find it infeasible to do so, is close to the parameter λ(P). Suppose we can find a random variable T(A) which also converges in probability to λ(P), and is computationally efficient since T(A) and λ(A) both converge in probability to λ(P), we can use T(A) as an accurate proxy for λ(A). This would address the first of the two issues described at the beginning of this subsection. Furthermore, if T(A) can be computed from a small subset of the data, that would also solve the second issue. This is our central heuristic, which we are going to formalize next.

The Chung-Lu Model

So far, we have not made any structural assumptions on P, we have simply considered the generic inhomogeneous random graph model. Under such a general model, it is very difficult to formulate a statistic T(A) which is cheap to compute and converges to λ(P). Therefore, we now introduce a structural assumption on P, in the form of the well-known Chung-Lu model that was introduced by Aiello et al. (2000) and subsequently studied in many papers (Chung and Lu, 2002; Chung et al. 2003; Decreusefond et al. 2012; Pinar et al. 2012; Zhang et al. 2017). For a network with n nodes, let δ = (δ1,…, δn) be the vector of expected degrees. Then under the Chung-Lu model,

$$ P(i,j) = \frac{\delta_{i} \delta_{j}}{{\sum}_{k=1}^{n} \delta_{k}}. $$
(2)

This formulation preserves E[di] = δi, where di is the degree of the ith node, and is very flexible with respect to degree heterogeneity.

Under model Eq. 2, note that rank(P) = 1, and we have

$$ \begin{array}{@{}rcl@{}} &&P = \frac{1}{{{\sum}_{i=1}^{n} \delta_{i}}} \delta\delta^{\prime}\\ &\Rightarrow &P \delta = \frac{1}{{{\sum}_{i=1}^{n} \delta_{i}}} \delta\delta^{\prime}\delta = \frac{{\sum}_{i=1}^{n} {\delta_{i}^{2}}}{{{\sum}_{i=1}^{n} \delta_{i}}}\delta \\ &\Rightarrow &\lambda(P) = \frac{{\sum}_{i=1}^{n} {\delta_{i}^{2}}}{{{\sum}_{i=1}^{n} \delta_{i}}}. \end{array} $$

Recall that we are looking for some computationally efficient T(A) which converges in probability to λ(P). We now know that under the Chung-Lu model, λ(P) is equal to the ratio of the second moment to the first moment of the degree distribution. Therefore, a simple estimator of λ(P) is given by the sample analogue of this ratio, i.e.,

$$ T_{1}(A) = \frac{{\sum}_{i=1}^{n} {d_{i}^{2}}}{{{\sum}_{i=1}^{n} d_{i}}}. $$
(3)

We now want to demonstrate that approximating λ(A) by T1(A) provides us with very substantial computational savings with little loss of accuracy. The approximation error can be quantified as

$$ e_{1}(A) = \left|\frac{T_{1}(A)}{\lambda(A)}-1\right|, $$
(4)

and our goal is to show that \(e_{1}(A) \rightarrow 0\) in probability, while the computational cost of T1(A) is much smaller than that of λ(A). We will show this both from a theoretical perspective and an empirical perspective. We next describe the empirical results from a simulation study, and we postpone the theoretical discussion to Section 3 for organizational clarity.

We used n = 5000, and constructed a Chung-Lu random graph model where P(i, j) = 𝜃i𝜃j. The model parameters 𝜃1,…, 𝜃n determine the expected degrees. We used two models for generating 𝜃i. In the Uniform model, 𝜃i were uniformly sampled from (0,0.25). In the PowerLaw model, 𝜃i were uniformly sampled from the PowerLaw distribution with parameters xmin = 0.01, β = 3. Note that the second model leads to heavy-tailed distribution.

Then, we randomly generated 20 networks from the model, and computed λ(A) and T1(A). The results are reported in Table 2. We observe that the runtimes for T1(A) are orders of magnitude faster than computing the eigenvalue. The average error for T1(A) is small, and so is the standard deviation (SD) of errors. Thus, even for moderately sized networks, using T1(A) as a proxy for λ(A) can reduce the computational cost to a great extent, without much loss in accuracy. For massive networks where n is in millions, this advantage of T1(A) over λ(A) is even greater; however, the computational burden for λ(A) becomes so large that this case is difficult to illustrate using standard computing equipment.

Table 2 Computational efficiency and statistical accuracy of T1(A)

Thus, T1(A) provides us with a computationally efficient and statistically accurate method for finding the epidemic threshold.

Comparing the results from Uniform and PowerLaw, we observe that errors are higher for the PowerLaw model. A likely explanation for this is that since the distribution is heavy tailed, the moment based estimator is less accurate. This is particularly true for larger n, since the impact of extreme values can shift the estimator heavily.

Sampling Based Approximation

The first approximation, T1(A), provides us with a computationally efficient method for finding the epidemic threshold. This addresses the first issue pointed out at the beginning of Section 2.1. However, computing T1(A) requires data on the degree of all n nodes of the network. Therefore, this does not solve the second issue pointed out at the beginning of Section 2.1. We now propose a second alternative, T2, to address the second issue. The idea behind this approximation is based on the same heuristic that was laid out in Section 2.2. Since λ(P) is a function of degree moments, we can estimate these moments using observed node degrees. In defining T1(A), we used observed degrees of all n nodes in the network. However, we can also estimate the degree moments by considering a small sample of nodes, based on random walk sampling. The algorithm for computing T2 is given in Algorithm 1.

figure a

Note that we only use (t + r) randomly sampled nodes for computing T2, which implies that we do not need to collect or store data on the n individuals. Therefore this method overcomes the second issue pointed out at the beginning of Section 2.1. The approximation error arising from this method can be defined as

$$ e_{2}(A) = \left|\frac{T_{2}(A)}{\lambda(A)}-1\right|, $$
(5)

and we want to show that \(e_{2}(A) \rightarrow 0\) in probability, while the data-collection cost of T2(A) is much less than that of T1(A). In the next section, we are going to formalize this.

Theoretical Results on Approximation Errors

In this section, we are going to establish that the approximation errors e1(A) and e2(A), defined in Eqs. 4 and 5, converge to zero in probability. From Theorem 2.1 of Chung et al. (2003), we know that when

$$ \frac{{\sum}_{i} {\delta_{i}^{2}}}{{\sum}_{i} \delta_{i}} > \log(n) \sqrt{\underset{1 \le i \le n}{\max} \delta_{i}} $$
(6)

holds, then for any 𝜖 > 0,

$$ P\left[\left|\frac{\lambda(A)}{\lambda(P)}-1\right| > \epsilon\right] \rightarrow 0. $$

Therefore, under Eq. 6, it suffices to show that, for any 𝜖 > 0,

$$ P\left[\left|\frac{T_{1}(A)}{\lambda(P)}-1\right| > \epsilon\right] \rightarrow 0, \text{ and } P\left[\left|\frac{T_{2}(A)}{\lambda(P)}-1\right| > \epsilon\right] \rightarrow 0. $$

To interpret the condition given in Eq. 6, suppose that the expected degrees are all of the same order, i.e., δi = O(nα) for some α ∈ (0,1). Then, the left hand side of Eq. 6 is O(nα), and the right hand side is \(\log (n) O(n^{\alpha /2})\), which means the condition is satisfied for any α > 0.

Convergence of T 1(A)

First, consider \(T_{1}(A) = \frac {{\sum }_{i=1}^{n} {d_{i}^{2}}}{{{\sum }_{i=1}^{n} d_{i}}}\), and recall that \(\lambda (P) = \frac {{\sum }_{i=1}^{n} {\delta _{i}^{2}}}{{{\sum }_{i=1}^{n} \delta _{i}}}\). For notational convenience, define \(m_{1} = {\sum }_{i=1}^{n} d_{i}, m_{2} = {\sum }_{i=1}^{n} {d_{i}^{2}}, \mu _{1} = {\sum }_{i=1}^{n} \delta _{i}, \mu _{2} = {\sum }_{i=1}^{n} {\delta _{i}^{2}}\). We would like to show that, under reasonable conditions, for any 𝜖 > 0,

$$ P\left[\left| \frac{m_{2} \mu_{1}} {m_{1} \mu_{2}} -1\right| > \epsilon\right] \rightarrow 0. $$
(7)

Next, we state the theorem which will establish a sufficient condition for this to hold. Please see Appendix for a proof of the theorem.

Theorem 2.

If the average of the expected degrees goes to infinity, i.e., \( \frac {1}{n}{{\sum }_{i} \delta _{i}} \rightarrow \infty \), and the spectral radius dominates \(\log ^{2}(n)\), i.e., \(\frac {{\sum }_{i} {\delta _{i}^{2}}}{{\sum }_{i} \delta _{i}} = \omega (\log ^{2} n)\), then for any 𝜖 > 0,

$$ P\left[\left|\frac{m_{1}}{\mu_{1}} - 1\right| > \epsilon\right] \rightarrow 0, \text{ and } P\left[\left|\frac{m_{2}}{\mu_{2}} - 1\right| > \epsilon\right] \rightarrow 0. $$

Thus, we have established that the approximation error for T1(A) goes to zero in probability. We have already observed in Section 2.2 that the runtime for T1(A) is orders of magnitude faster that the runtime for λ(A). Therefore, T1(A) is both computationally efficient and statistically accurate as an approximation of the epidemic threshold.

Convergence of T 2(A)

Next, consider Algorithm 1. Let π denote the stationary distribution of the simple random walk on the given graph. Suppose the number of edges in the given graph is m. Recall that, π is given by \(\pi _{v} = \frac {d_{v}}{{\sum }_{v} d_{v}}\) for all v. For brevity, we define the mixing time of the graph A, denoted as tmix(A), to mean the number of steps required by the simple random walk to reach a distribution \(\hat {\pi }\) such that \(\|\hat {\pi } - \pi \|_{1} = o(\frac {1}{n^{2}})\). Let T2(A) be the estimate returned by the Algorithm 1. We first show an easy lemma that characterizes the bias of the estimator T2(A). Please see Appendix for a proof.

Lemma 3.

If x is a node that is randomly sampled from π, and dx is its degree, then \(E[d_{x}]= \frac {{\sum }_{i} {d_{i}^{2}}}{{\sum }_{i} d_{i}}. \) Consequently if \(\hat {\pi }\) is such that \(\|\pi - \hat {\pi }\|_{1} = o(n^{-1})\) and x is sampled from \(\hat {\pi }\), then \(E[d_{x}]= (1 \pm o(1))\frac {{\sum }_{i} {d_{i}^{2}}}{{\sum }_{i} d_{i}}\).

Next, we show that the estimator vRW is actually concentrated around its expectation.

Theorem 4 (Lezaud (1998)).

Let (Xn) be a irreducible and reversible Markov Chain on a finite set V with Q being the transition matrix. Let π be the stationary distribution. Let \(f: V\rightarrow \Re \) be such that Eπ[f] = 0, \(\|f\|_{\infty } \leq 1\) and 0 < Eπ[f2] ≤ b2. Then, for any initial distribution q, any positive integer r and all 0 < γ ≤ 1,

$$ \begin{array}{@{}rcl@{}} \Pr_{q} \left[ r^{-1}\sum\limits_{i=1}^{r} f(X_{i})\ge \gamma \right] \le e^{-\varepsilon(Q)/5} S_{q} \exp\left( -\frac{r\gamma^{2}\varepsilon(Q)}{4b^{2}(1 + h(5\gamma/b^{2}))} \right), \end{array} $$

where ε(Q) = 1 − λ2(Q), λ2(Q) being the second largest eigenvalue of Q, Sq = ∥q/π2 (in the 2(π) norm), and

$$ h(x) = \frac{1}{2}(\sqrt{1+x} - (1 - x/2)). $$

If γb2 and ε(Q) ≪ 1, then the upper bound becomes

$$ (1 +o(1))S_{q} \exp\left( - \frac{r\gamma^{2}\varepsilon(Q)}{4b^{2}(1+o(1))} \right). $$

Using the above result, we bound the sample complexity of our estimator. We first quote the following result that we use to bound λ1 of the transition matrix. Please see Appendix for a proof.

Theorem 5.

Let Q = D− 1A. Let 𝜖, δ ∈ (0,1). Algorithm 1, using \(r = \frac {1}{\varepsilon (Q) \epsilon ^{3/2}} \times \frac {12m d_{\max \limits }}{({\sum }_{v} {d_{v}^{2}})} \log (1/\delta )\) and ttmix(G) returns an estimate vRW that satisfies, with probability 1 − δ,

$$ (1 - \epsilon)\frac{{\sum}_{v} {{d}_{v}^{2}}}{{\sum}_{v} d_{v}}\le T_{2}(A) \le (1 + \epsilon)\frac{{\sum}_{v} {{d}_{v}^{2}}}{{\sum}_{v} d_{v}}. $$

The number of nodes that are touched by algorithm is O(t + r).

Note that Q = D− 1A has the same set of eigenvalues as the matrix D− 1/2AD− 1/2. For the Chung-Lu model, the eigenvalues of the matrix L = ID− 1/2AD− 1/2 can be bounded by the following result from Chung et al. (2003).

Theorem 6.

Let L = ID− 1/2AD− 1/2 denote the normalized Laplacian. Let A be a random graph generated from the given expected degrees model, with expected degrees {δi}, if the minimum expected degree \(\delta _{\min \limits }\) satisfies \(\delta _{min} \gg \ln (n)\), then with probability at least 1 − 1/n = 1 − o(1), we have that for all eigenvalues \(\lambda _{k}(L) > \lambda _{\min \limits }(L)\) of the Laplacian of G,

$$\left| 1 - \lambda_{k}(L)\right| < 2 \sqrt{\frac{6\ln(2n)}{\delta_{\min}}} = o(1). $$

It follows above that ε(Q) = 1 − λ2(Q) = 1 − λ2(D− 1/2AD− 1/2) = λn− 1(ID− 1/2AD− 1/2) = 1 − o(1). Putting these together, we get the following corollary on the total number of node queries.

Corollary 6.1.

For a graph generated from the expected degrees model, with probability 1 − 1/n, Algorithm 1, needs to query

$$ \ln(n) + \frac{1}{\epsilon^{3/2}} \times \frac{6({\sum}_{v} d_{v}) d_{\max}}{({\sum}_{v} {d_{v}^{2}})} \log(1/\delta) $$

nodes in order to get a (1 ± 𝜖) estimate of \({\sum }_{v} {d_{v}^{2}}/2m\).

Note \(\frac {6({\sum }_{v} d_{v}) d_{\max \limits }}{({\sum }_{v} {d_{v}^{2}})} \le \frac {6d_{\max \limits }}{d_{\min \limits }}\), but this is a loose bound, better bounds can be derived for power law degree distributions, for instance.

Thus, we have proved that the approximation error for T2(A) goes to zero in probability. In addition, Corollary 6.1 shows that the number of nodes that we need to query in order to have an accurate approximation is much smaller than n. Furthermore, computing T2 only requires node sampling and counting degrees, and therefore the runtime is much smaller than eigenvalue algorithms. Therefore, T2(A) is a computationally efficient and statistically accurate approximation of the epidemic threshold, while also requiring a much smaller data budget compared to T1(A).

Numerical Results

In this section, we characterize the empirical performance of our sampling algorithm on two synthetic networks, one generated from the Chung-Lu model and the second generated from the preferential attachment model of Barabási and Albert (1999).

Data

Our first dataset is a graph generated from the Chung-Lu model of expected degrees. We generated a powerlaw sequence (i.e. fraction of nodes with degree d is proportion to dβ) with exponent β = 2.5 and then generated a graph with this sequence as the expected degrees. Table 3 notes that, as expected, the first eigenvalue λ1(A) is close to \(\frac {{\sum }_{v} {d_{v}^{2}}}{{\sum }_{v} d_{v}}\).

Table 3 Statistics of the two synthetic datasets used

The second dataset is generated from the preferential attachment model (Barabási and Albert, 1999), where each incoming node adds 5 edges to the existing nodes, the probability of choosing a specific node as neighbor being proportional to the current degree of that node. While the preferential attachment model naturally gives rise to a directed graph, we convert the graph to an undirected one before running our algorithm. It is interesting to note that even in this case the Chung-Lu model does not hold, our first approximation, T1(A), is close to λ(A).

Implementation Details

In each of the networks, the random walk algorithm presented in Algorithm 1 was used for sampling. The random walk was started from an arbitrary node and every 10th node was sampled (to account for the mixing time) from the walk. These samples were then used to calculate T2(A). This experiment was repeated 10 times. These gave estimates \({T_{2}^{1}},\ldots ,T_{2}^{10}\). We then calculate two relative errors ∀i ∈{1,2,…,10},

$$ \begin{array}{@{}rcl@{}} \epsilon^{T_{1}-T_{2}}_{i} = \frac{\left|{T_{2}^{i}} - T_{1}(A)\right|}{T_{1}(A)}, \ \epsilon^{\lambda-T2}_{i} = \frac{\left|{T_{2}^{i}} - \lambda(A)\right|}{\lambda(A)}. \end{array} $$

We also note the following relation between the two error metrics.

$$ \begin{array}{@{}rcl@{}} \epsilon_{i}^{\lambda-T_{2}} \!\!\!\!& =&\!\!\!\! \frac{|\lambda - {T_{2}^{i}}|}{\lambda} \!\le\! \frac{|T_{1} - {T_{2}^{i}}|}{\lambda} + \frac{|T_{1} - {T_{2}^{i}}|}{\lambda} = \frac{|T_{1} - {T_{2}^{i}}|}{\lambda} + \epsilon^{\lambda-T_{1}} = \frac{T_{1}}{\lambda} \epsilon_{i}^{T_{1} - T_{2}} + \epsilon^{\lambda-T_{1}}\\ \!\!\!\!& =& \!\!\!\!(1 + \epsilon^{\lambda-T_{1}}) \epsilon_{i}^{T_{1} - T_{2}} + \epsilon^{\lambda-T_{1}}. \end{array} $$

We denote the averages of \(\{\epsilon _{i}^{T_{1}-T_{2}}\}\) and \(\{\epsilon _{i}^{\lambda -T_{2}}\}\) as \(\epsilon ^{T_{1}-T_{2}}\) and \(\{\epsilon ^{\lambda -T_{2}}\}\) respectively. It is easy to observe that the above relation holds between the two average quantities too.

We plot the averages \(\epsilon ^{T_{1}-T_{2}}\) and \(\epsilon ^{\lambda -T_{2}}\), along with the error-bars that reflect the standard deviation, against the actual number of nodes seen by the random walk. Note that the x-axis accurately reflect how many times the algorithm actually queried the network, not just the number of samples used. Measuring the cost of uniform node sampling in this setting, for instance, would need to keep track of how many nodes are touched by a Metropolis-Hastings walk that implements the uniform distribution.

Results

In Fig. 1 We plot the two results for mean relative error, measure by \(\epsilon _{i}^{\lambda -T_{2}}\) and \(\epsilon _{i}^{T_{1}-T_{2}}\).

Figure 1
figure 1

Results on three synthetic networks

For the two Chung-Lu networks, the algorithm is able to get a 10% approximation to the statistic T1(A) by exploring at most 10% of the network. With more samples from the random walk, the mean relative errors settle to around 4–5%. However, once we measure the mean relative errors with respect to λ(A), it becomes clearer that the estimator T2(A) does better when the graph is closer to the assumed (i.e. Chung-Lu) model. For the Chung-Lu graph, the mean error 𝜖λT2 essentially is very similar to \(\epsilon ^{T_{1}-T_{2}}\), which is to be expected. For the preferential attachment graph too, it is clear that the estimate T2 is able to achieve a better than 10% relative error approximation of λ(A).

Note that, if we were instead counting only the nodes whose degrees were actually used for estimation, the fraction of network used would be roughly 1–2% in all the cases, the majority of the node query cost actually goes in making the random walk mix, by using an initial burn-in period and by maintaining certain number of steps between subsequent samples.

Discussion

In this work, we investigated the problem of computing SIR epidemic thresholds of social contact networks from the perspective of statistical inference. We considered the two challenges that arise in this context, due to high computational and data-collection complexity of the spectral radius. For the Chung-Lu network generative model, the spectral radius can be characterized in terms of the degree moments. We utilized this fact to develop two approximations of the spectral radius. The first approximation is computationally efficient and statistically accurate, but requires data on observed degrees of all nodes. The second approximation retains the computationally efficiency and statistically accuracy of the first approximation, while also reducing the number of queries or the sample size quite substantially. The results seem very promising for networks arising from the Chung-Lu and preferential attachment generative models.

There are several interesting and important future directions. The methods proposed in this paper have provable guarantees only under the Chung-Lu model, although it works very well under the preferential attachment model. This seems to indicate that the degree based approximation might be applicable to a wider class of models. On the other hand, this leaves open the question of developing a better “model-free” estimator, as well as asking similar questions about other network features.

In this work we only considered the problem of accurate approximation of the epidemic threshold. From a statistical as well as a real-world perspective, there are several related inference questions. These include uncertainty quantification, confidence intervals, one-sample and two-sample testing, etc.

Social interaction patterns vary dynamically over time, and such network dynamics can have significant impacts on the contagion process Leitch et al. (2019). In this paper we only considered static social contact networks, and in future we hope to study epidemic thresholds for time-varying or dynamic networks.

Finally, we note that the formulation in Eq. 1 is an approximation of the true epidemic threshold under the so-called quenched-mean-field approximation (Pastor-Satorras et al. 2015; Karrer et al. 2014). In recent work Castellano and Pastor-Satorras (2020), it has been shown that the SIS epidemic transition occurs at some point that is intermediate between λ(A) and T1(A). In future work, we plan to extend our results to these more accurate expressions for the epidemic threshold.

References

  • Aiello, W., Chung, F. and Lu, L. (2000). A random graph model for massive graphs, In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of computing. ACM, p. 171–180.

  • Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286, 509–512.

    MathSciNet  MATH  Google Scholar 

  • Barrett, C.L., Bisset, K.R., Eubank, S.G., Feng, X. and Marathe, M.V. (2008). Episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks, In SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, p. 1–12.

  • Benaych-Georges, F., Bordenave, C., Knowles, A. et al. (2019). Largest eigenvalues of sparse inhomogeneous erdős–rényi graphs. Ann. Probab.47, 1653–1676.

    MathSciNet  MATH  Google Scholar 

  • Bengtsson, L., Gaudart, J., Lu, X., Moore, S., Wetter, E., Sallah, K., Rebaudet, S. and Piarroux, R. (2015). Using mobile phone data to predict the spatial spread of cholera. Sci. Rep. 5, 8923.

    Google Scholar 

  • Bhadra, S., Chakraborty, K., Sengupta, S. and Lahiri, S. (2019). A bootstrap-based inference framework for testing similarity of paired networks. arXiv:1911.06869.

  • Bickel, P.J. and Chen, A. (2009). A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. 106, 21068–21073.

    MATH  Google Scholar 

  • Bickel, P.J. and Sarkar, P. (2016). Hypothesis testing for automated community detection in networks. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 78, 253–273.

    MathSciNet  MATH  Google Scholar 

  • Bordenave, C., Benaych-Georges, F. and Knowles, A (2020). Spectral radii of sparse random matrices. Ann. l’Inst. Henri Poincare (B) Probab. Stat.

  • Brauer, F. and Castillo-Chavez, C. (2012). Mathematical models in population biology and epidemiology, vol. 2. Springer, Berlin.

    MATH  Google Scholar 

  • Castellano, C. and Pastor-Satorras, R. (2020). Cumulative merging percolation and the epidemic transition of the susceptible-infected-susceptible model in networks. Phys. Rev. X 10, 011070.

    Google Scholar 

  • Chakrabarti, D., Wang, Y., Wang, C., Leskovec, J. and Faloutsos, C. (2008). Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur.10, 1–26.

    Google Scholar 

  • Chinazzi, M., Davis, J.T., Ajelli, M., Gioannini, C., Litvinova, M., Merler, S., Piontti, A.P., Mu, K., Rossi, L., Sun, K. et al. (2020). The effect of travel restrictions on the spread of the 2019 novel coronavirus (covid-19) outbreak. Science 368, 6489, 395–400.

    Google Scholar 

  • Chung, F. and Lu, L. (2002). The average distances in random graphs with given expected degrees. Proc. Natl. Acad. Sci. 99, 15879–15882.

    MathSciNet  MATH  Google Scholar 

  • Chung, F. and Radcliffe, M. (2011). On the spectra of general random graphs. Electron. J. Combinator. 18, P215–P215.

    MathSciNet  MATH  Google Scholar 

  • Chung, F., Lu, L. and Vu, V. (2003). Eigenvalues of random power law graphs. Ann. Combinator. 7, 21–33.

    MathSciNet  MATH  Google Scholar 

  • Colizza, V. and Vespignani, A. (2007). Invasion threshold in heterogeneous metapopulation networks. Phys. Rev. Lett. 99, 148701.

    Google Scholar 

  • Dallas, T.A., Krkošek, M. and Drake, J.M. (2018). Experimental evidence of a pathogen invasion threshold. R. Soc. Open Sci. 5, 171975.

    Google Scholar 

  • Decreusefond, L., Dhersin, J. -S., Moyal, P., Tran, V.C. et al. (2012). Large graph limit for an sir process in random network with heterogeneous connectivity. Ann. Appl. Probab. 22, 541–575.

    MathSciNet  MATH  Google Scholar 

  • Eubank, S., Guclu, H., Kumar, V.A., Marathe, M.V., Srinivasan, A., Toroczkai, Z. and Wang, N. (2004). Modelling disease outbreaks in realistic urban social networks. Nature 429, 180–184.

    Google Scholar 

  • Galvani, A.P. and May, R.M. (2005). Dimensions of superspreading. Nature 438, 293–295.

    Google Scholar 

  • Ghoshdastidar, D. and von Luxburg, U. (2018). Practical methods for graph two-sample testing, In Advances in Neural Information Processing Systems, p. 3019–3028.

  • Gómez, S., Arenas, A., Borge-Holthoefer, J., Meloni, S. and Moreno, Y. (2010). Discrete-time markov chain approach to contact-based disease spreading in complex networks. EPL (Europhys. Lett.) 89, 38009.

    Google Scholar 

  • Handcock, M.S., Raftery, A.E. and Tantrum, J.M. (2007). Model-based clustering for social networks. J. R. Stat. Soc.: Ser. A 170, 301–354.

    MathSciNet  Google Scholar 

  • Hethcote, H.W. (2000). The mathematics of infectious diseases. SIAM Rev. 42, 599–653.

    MathSciNet  MATH  Google Scholar 

  • Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables, In The Collected Works of Wassily Hoeffding. Springer, p. 409–426.

  • Hoff, P.D., Raftery, A.E. and Handcock, M.S. (2002). Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97, 1090–1098.

    MathSciNet  MATH  Google Scholar 

  • Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu, J., Gu, X. et al. (2020). Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395, 497–506.

    Google Scholar 

  • Karrer, B., Newman, M.E. and Zdeborová, L. (2014). Percolation on sparse networks. Phys. Rev. Lett. 113, 20, 208702.

    Google Scholar 

  • Keeling, M. (2005). The implications of network structure for epidemic dynamics. Theor. Popul. Biol. 67, 1–8.

    MATH  Google Scholar 

  • Kermack, W.O. and McKendrick, A.G. (1927). A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. Ser. A, Containing papers of a mathematical and physical character 115, 700–721.

    MATH  Google Scholar 

  • Kermack, W.O. and McKendrick, A.G. (1932). Contributions to the mathematical theory of epidemics. ii.—the problem of endemicity. Proc. R. Soc. Lond. Ser. A, Containing papers of a mathematical and physical character 138, 55–83.

    MATH  Google Scholar 

  • Kermack, W.O. and McKendrick, A.G. (1933). Contributions to the mathematical theory of epidemics. iii.—further studies of the problem of endemicity. Proc. R. Soc. Lond. Ser. A, Containing Papers of a Mathematical and Physical Character 141, 94–122.

    MATH  Google Scholar 

  • Komolafe, T., Quevedo, A.V., Sengupta, S. and Woodall, W.H. (2019). Statistical evaluation of spectral methods for anomaly detection in static networks. Netw. Sci. 7, 319–352.

    Google Scholar 

  • Kramer, A.M., Pulliam, J.T., Alexander, L.W., Park, A.W., Rohani, P. and Drake, J.M. (2016). Spatial spread of the west africa ebola epidemic. R. Soc. Open Sci. 3, 8, 160294.

    Google Scholar 

  • Krivitsky, P.N., Handcock, M.S., Raftery, A.E. and Hoff, P.D. (2009). Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models. Social Netw. 31, 204–213.

    Google Scholar 

  • Leitch, J., Alexander, K.A. and Sengupta, S. (2019). Toward epidemic thresholds on temporal networks: a review and open questions. Appl. Netw. Sci. 4, 105.

    Google Scholar 

  • Lezaud, P. (1998). Chernoff-type bound for finite markov chains. Ann. Appl. Probab. 8, 3, 849–867.

    MathSciNet  MATH  Google Scholar 

  • Meyers, L.A., Pourbohloul, B., Newman, M., Skowronski, D.M. and Brunham, R.C. (2005). Network theory and SARS: predicting outbreak diversity. J. Theor. Biol. 232, 71–81.

    MathSciNet  MATH  Google Scholar 

  • Newman, M.E.J. (2002). Spread of epidemic disease on networks. Phys. Rev. E 66, 1, 016128.

    MathSciNet  Google Scholar 

  • Pastor-Satorras, R., Castellano, C., Van Mieghem, P. and Vespignani, A. (2015). Epidemic processes in complex networks. Rev. Mod. Phys. 87, 925–979.

    MathSciNet  Google Scholar 

  • Pinar, A., Seshadhri, C. and Kolda, T.G. (2012). The similarity between stochastic Kronecker and Chung-lu graph models, In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, p. 1071–1082.

  • Pourbohloul, B., Meyers, L., Skowronski, D., Krajden, M., Patrick, D. and Brunham, R. (2005). Modeling control strategies of respiratory pathogens. Emerg. Infect. Dis. 11, 1249–56.

    Google Scholar 

  • Prakash, B.A., Chakrabarti, D., Faloutsos, M., Valler, N. and Faloutsos, C. (2010). Got the flu (or mumps)? Check the Eigenvalue! arXiv:1004.0060.

  • Rocha, L.E.C., Liljeros, F. and Holme, P. (2011). Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLoS Comput. Biol. 7, e1001109.

    Google Scholar 

  • Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Stat. 39, 1878–1915.

    MathSciNet  MATH  Google Scholar 

  • Sengupta, S. (2018). Anomaly detection in static networks using egonets. arXiv:1807.089251807.08925.

  • Sengupta, S. and Chen, Y. (2015). Spectral clustering in heterogeneous networks. Stat. Sin. 25, 1081–1106.

    MathSciNet  MATH  Google Scholar 

  • Sengupta, S. and Chen, Y. (2018). A block model for node popularity in networks with community structure. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 80, 365–386.

    MathSciNet  MATH  Google Scholar 

  • Shulgin, B., Stone, L. and Agur, Z. (1998). Pulse vaccination strategy in the sir epidemic model. Bull. Math. Biol. 60, 1123–1148.

    MATH  Google Scholar 

  • Sun, K., Chen, J. and Viboud, C. (2020). Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. Lancet Digit. Health 2, 4, e201–e208.

    Google Scholar 

  • Tang, M., Athreya, A., Sussman, D.L., Lyzinski, V., Park, Y. and Priebe, C.E. (2017a). A semiparametric two-sample hypothesis testing problem for random graphs. J. Comput. Graph. Stat. 26, 344–354.

    MathSciNet  MATH  Google Scholar 

  • Tang, M., Athreya, A., Sussman, D.L., Lyzinski, V. and Priebe, C.E. (2017b). A nonparametric two-sample hypothesis testing problem for random graphs. Bernoulli 23, 1599–1630.

    MathSciNet  MATH  Google Scholar 

  • Van den Driessche, P. and Watmough, J. (2002). Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission. Math. Biosci. 180, 29–48.

    MathSciNet  MATH  Google Scholar 

  • Wallinga, J., Heijne, J.C. and Kretzschmar, M. (2005). A measles epidemic threshold in a highly vaccinated population. PLoS Med. 2, e316.

    Google Scholar 

  • Wang, Y.R. and Bickel, P.J. (2017). Likelihood-based model selection for stochastic block models. Ann. Stat. 45, 500–528.

    MathSciNet  MATH  Google Scholar 

  • Wang, Y., Chakrabarti, D., Wang, C. and Faloutsos, C. (2003). Epidemic spreading in real networks: an eigenvalue viewpoint, In 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings. IEEE Computer Society, Florence, p. 25–34.

  • Wang, W., Liu, Q.H., Zhong, L.F. et al. (2016). Predicting the epidemic threshold of the susceptible-infected-recovered model. Sci. Rep. 6, 24676. https://doi.org/10.1038/srep24676.

    Google Scholar 

  • Wang, W., Tang, M., Stanley, H.E. and Braunstein, L.A. (2017). Unification of theoretical approaches for epidemic spreading on complex networks. Rep. Progr. Phys. 80, 036603.

    Google Scholar 

  • Wang, C., Horby, P.W., Hayden, F.G. and Gao, G.F. (2020). A novel coronavirus outbreak of global health concern. Lancet 395, 470–473.

    Google Scholar 

  • Woolhouse, M.E.J., Dye, C., Etard, J.F., Smith, T., Charlwood, J.D., Garnett, G.P., Hagan, P., Hii, J.L.K., Ndhlovu, P.D., Quinnell, R.J., Watts, C.H., Chandiwana, S.K. and Anderson, R.M. (1997). Heterogeneities in the transmission of infectious agents: implications for the design of control programs. Proc. Natl. Acad. Sci. 94, 338–342.

    Google Scholar 

  • Yan, X., Shalizi, C., Jensen, J.E., Krzakala, F., Moore, C., Zdeborová, L., Zhang, P. and Zhu, Y. (2014). Model selection for degree-corrected block models. J. Stat. Mech.: Theory Exp. 2014, P05007.

    Google Scholar 

  • Zhang, X., Moore, C. and Newman, M.E. (2017). Random graph models for dynamic networks. Eur. Phys. J. B 90, 200.

    MathSciNet  Google Scholar 

  • Zhao, Y., Levina, E. and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Stat. 40, 2266–2292.

    MathSciNet  MATH  Google Scholar 

  • Zhao, M.J., Driscoll, A.R., Sengupta, S., Fricker, Jr. R. D., Spitzner, D.J. and Woodall, W.H. (2018). Performance evaluation of social network anomaly detection using a moving window–based scan method. Qual. Reliab. Eng. Int. 34, 1699–1716.

    Google Scholar 

  • Zhu, N., Zhang, D., Wang, W., Li, X., Yang, B., Song, J., Zhao, X., Huang, B., Shi, W., Lu, R. et al. (2020). A novel coronavirus from patients with pneumonia in China. New Engl. J. Med., 2019.

Download references

Acknowledgements

We thank the Associate Editor and two anonymous reviewers for their constructive suggestions, which were really helpful towards the improvement of the manuscript. Anirban acknowledges the kind support of the N. Rama Rao Chair Professorship at IIT Gandhinagar, the Google India AI/ML award (2020), Google Faculty Award (2015), and CISCO University Research Grant (2016). Srijan acknowledges the support from an NIH R01 grant 1R01LM013309.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anirban Dasgupta.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Disclaimer with Respect to Current Pandemic

We do realize that in the face of the current pandemic, while it is important to pursue research relevant to it, it is also important to be responsible in following the proper scientific process. We would like to state that in this work, the question of epidemic threshold estimation has been formalized from a theoretical viewpoint in a much used, but simple, random graph model. We are not yet at a position to give any guarantees about the performance of our estimator in real social networks. We do hope, however, that the techniques developed here can be further refined to work to give reliable estimators in practical settings.

Anirban Dasgupta’s work is partially supported by grants from DBT India, Google and CISCO.

Srijan Sengupta’s work is partially supported by an NIH R01 grant 1R01LM013309.

Appendix A. Technical Proofs

Appendix A. Technical Proofs

A.1 Proof of Theorem 2

We will show that for any \(\epsilon ^{\prime } > 0\),

$$ P\left[\left|\frac{m_{1}}{\mu_{1}} - 1\right| > \epsilon^{\prime}\right] \rightarrow 0, P\left[\left|\frac{m_{2}}{\mu_{2}} - 1\right| > \epsilon^{\prime}\right] \rightarrow 0. $$
(8)

We first prove that Eq. 8 implies Eq. 7. Equation 8 implies that

$$ P\left[\left\{\left|\frac{m_{1}}{\mu_{1}} - 1 \right| > \epsilon^{\prime}\right\} \cup \left\{\left|\frac{m_{2}}{\mu_{2}} - 1\right| > \epsilon^{\prime}\right\}\right] \rightarrow 0. $$

Now, consider the event \(\left \{\left |\frac {m_{1}}{\mu _{1}} - 1 \right | \le \epsilon ^{\prime }\right \} \cap \left \{\left |\frac {m_{2}}{\mu _{2}} - 1\right | \le \epsilon ^{\prime }\right \}\). Note that m2/m1 is a strictly increasing function of m2 and a strictly decreasing function of m1. Therefore, for outcomes belonging to the above event,

$$ \frac{\mu_{2}}{\mu_{1}} \times \frac{1-\epsilon^{\prime}}{1+\epsilon^{\prime}} \le \frac{m_{2}}{m_{1}} \le \frac{\mu_{2}}{\mu_{1}} \times \frac{1+\epsilon^{\prime}}{1-\epsilon^{\prime}}. $$

Note that

$$ 1 - \frac{1-\epsilon^{\prime}}{1+\epsilon^{\prime}} = \frac{2\epsilon^{\prime}}{1+\epsilon^{\prime}} < 2\epsilon^{\prime}, \text{ and } \frac{1+\epsilon^{\prime}}{1-\epsilon^{\prime}}-1 = \frac{2\epsilon^{\prime}}{1-\epsilon^{\prime}} < 4\epsilon^{\prime}, $$

given that \(\epsilon ^{\prime } < 1/2\). Now, fix 𝜖 > 0 and let \(\epsilon ^{\prime } = \epsilon /4\). Then,

$$ Eq.~~ \Rightarrow P\left[\left| \frac{m_{2} \mu_{1}} {m_{1} \mu_{2}} -1\right| > 4\epsilon^{\prime}\right] \rightarrow 0 \Rightarrow Eq.~~~. $$

Thus, proving Eq. 8 is sufficient for proving Eq. 7.

Proof 1 (Proof of Theorem 2).

We will use Hoeffding’s inequality (Hoeffding, 1994) for the first part, and we begin by stating the inequality for the sum of Bernoulli random variables. Let B1,…, Bm be m independent (but not necessarily identically distributed) Bernoulli random variables, and \(S_{m} = {\sum }_{i=1}^{m} B_{i}\). Then for any t > 0,

$$ P[|S_{m} - {E}[S_{m}]| \geq t] \le 2 \exp\left( {\frac{-2t^{2}}{m}}\right). $$

In our case,

$$ m_{1} = \sum\limits_{i=1}^{n} d_{i} = \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} A(i,j) = 2 \sum\limits_{i<j} A(i,j), $$

and we know that {A(i, j) : 1 ≤ i < jn} are independent Bernoulli random variables. Fix 𝜖 > 0 and note that \(E[{\sum }_{i<j} A(i,j)] = \frac {1}{2}\mu _{1}\). Using Hoeffding’s inequality with Sm = m1/2, \(m = {n \choose 2}\), and \(t = \frac {\epsilon }{2} \mu _{1}\), we get

$$ P\left[\left|\frac{m_{1}}{2} - \frac{\mu_{1}}{2}\right| > \frac{\epsilon}{2} \mu_{1}\right] \le 2 \exp\left( -\epsilon^{2} {\frac{{\mu_{1}^{2}}}{n(n-1)}}\right). $$

Since \( \frac {1}{n}{{\sum }_{i} \delta _{i}} \rightarrow \infty \), the right hand side goes to zero. Therefore,

$$ P\left[\left|\frac{m_{1}}{\mu_{1}} - 1\right| > \epsilon\right] \rightarrow 0. $$

For the second part, we can characterize m2 as following.

$$ E[m_{2}] = E\left[\sum\limits_{i} {d_{i}^{2}}\right] = \sum\limits_{i} (E[d_{i}])^{2} + var(d_{i}) = \mu_{2} + var(d_{i}), $$

and hence,

$$ |m_{2} - \mu_{2} | \le |m_{2} - E[m_{2}]| + |E[m_{2}] - \mu_{2}|. $$

We show that, under the given assumptions, with probability 1 − o(1), |m2E[m2]| = o(μ2). Furthermore, |E[m2] − μ2| = o(μ2).

As noted before, each di is a sum of binomial random variables. By applying Chernoff-Hoeffding bound, and union bounding over all i ∈{1,…, n}, we can get, with probability 1 − o(1), and for any fixed 𝜖 ∈ (0,1),

$$ \forall i \in \{1, \ldots, n\},\ d_{i} \le \delta_{i} + \max\{\epsilon \delta_{i}, O(\log(n))\}. $$

Let the above event be called the event \(\mathcal {A}\). If the event \(\mathcal {A}\) happens, then,

$$ \begin{array}{@{}rcl@{}} m_{2} = \sum\limits_{i} {d_{i}^{2}}& \le& \sum\limits_{i} {\delta_{i}^{2}} + 2\delta_{i}\max(\epsilon \delta_{i}, O(\log(n)))+ \max(\epsilon^{2} {\delta_{i}^{2}}, O(\log^{2} n))\\ & \le& \mu_{2} + 2\sum\limits_{i} \delta_{i}(\epsilon \delta_{i} + O(\log(n))) + (\epsilon^{2} {\delta_{i}^{2}} + O(\log^{2} n))\\ & \le& \mu_{2} + 3\epsilon\mu_{2} + (n + \sum\limits_{i} \delta_{i})O(\log^{2} n) \\ \left| \frac{m_{2}}{\mu_{2}} - 1 \right| & \le& 3\epsilon + (n + \sum\limits_{i} \delta_{i})O(\log^{2} n) / \mu_{2}. \end{array} $$

Note that \(\frac {n}{\mu _{2}} = \frac {1}{{\sum }_{i} {\delta _{i}^{2}} / n} \rightarrow 0\) under the given assumption. Furthermore,

$$ \begin{array}{@{}rcl@{}} \frac{({\sum}_{i} \delta_{i})O(\log^{2} n)}{{\sum}_{i} {\delta_{i}^{2}}} = o(1) \rightarrow 0. \end{array} $$

Putting these together, and using \(\epsilon ^{\prime } = 3\epsilon \) we have the given claim. □

A.2 Proof of Theorem 5

Proof 2 (Proof of Lemma 3).

It is easy to see that

$$ E_{x\sim \pi} [d_{x}] = \sum\limits_{v = 1}^{n} d_{v} \times \pi_{v} =\frac{{\sum}_{v} {d_{v}^{2}}}{{\sum}_{v} d_{v}}. $$

We show the second claim as follows:

$$ |E_{x\sim \pi} [d_{x}] - E_{x\sim \hat{\pi}} [d_{x}]| \le \sum\limits_{v = 1}^{n} d_{v} |\pi_{v} - \hat{\pi_{v}}| \le n \|\pi - \hat{\pi}\|_{1} = o(1). $$

Proof 3 (Proof of Theorem 5).

In our setting the set V is the set of vertices. Define the function f(Xi) as :

$$ d_{\max} \times f(X_{i}) = d_{X_{i}} - E_{\pi}[d_{X_{i}}].$$

f(⋅) clearly satisfies Eπ[f] = 0 and that \(\|f\|_{\infty } \le 1\). We can bound Eπ[f2] as

$$ \begin{array}{@{}rcl@{}} E_{\pi}[f^{2}] \le d_{\max}^{-2}E_{\pi}[{d_{v}^{2}}] = d_{\max}^{-2} \sum\limits_{v} \frac{{d_{v}^{2}} \times d_{v}}{{\sum}_{v} d_{v}} =d_{\max}^{-2} \sum\limits_{v} \frac{{d_{v}^{3}}}{{\sum}_{v} d_{v}}. \end{array} $$

Using the first t steps, we reach the distribution \(\hat {\pi }\) that satisfies \(\|\pi - \hat {\pi }\|_{1} = o(n^{-1})\). Hence,

$$ \begin{array}{@{}rcl@{}} \|\hat{\pi} / \pi\|_{2}^{2} & =& \sum\limits_{v} \pi_{v}(\hat{\pi}_{v}/\pi_{v})^{2} = \sum\limits_{v} \hat{\pi}_{v}^{2} / \pi_{v} = \sum\limits_{v}(\pi_{v} + (\hat{\pi}_{v} - \pi_{v}))^{2} / \pi_{v}\\ & =& \sum\limits_{v} (\pi_{v} + 2 (\hat{\pi}_{v} - \pi_{v}) + (\hat{\pi}_{v} - \pi_{v})^{2}/\pi_{v})\\ & =& 1 + 2\times (1 - 1) + \sum\limits_{v} (\hat{\pi}_{v} - \pi_{v})^{2}/\pi_{v} \le 1 + \|\pi - \hat{\pi}\|_{2}^{2} / \min(\pi_{v})\\ & \le& 1 + \|\pi - \hat{\pi}\|_{1}^{2} \left( \sum\limits_{v} d_{v}\right)/d_{\min} = 1 + o(1), \end{array} $$

where the last step follows as \(\|\pi - \hat {\pi }\|_{1} = o(n^{-2})\).

We use \(b^{2} = d_{\max \limits }^{-2} {\sum }_{v} \frac {{d_{v}^{3}}}{{\sum }_{v} d_{v}}\) and \(\gamma = \epsilon d_{\max \limits }^{-1}\times \frac {{\sum }_{v} {d_{v}^{2}}}{{\sum }_{v} d_{v}}\). Hence

$$ \begin{array}{@{}rcl@{}} \gamma / b^{2} = \epsilon d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}} \ \text{and}\ \gamma^{2} / b^{2} = \epsilon^{2} \frac{({\sum}_{v} {d_{v}^{2}})^{2}}{({\sum}_{v} d_{v}) ({\sum}_{v} {d_{v}^{3}})}. \end{array} $$

Hence,

$$ \begin{array}{@{}rcl@{}} h(5\gamma / b^{2})& =& \left( 1 + 5 \epsilon d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}}\right)^{1/2} - 1 + 5\epsilon d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{2{\sum}_{v} {d_{v}^{3}}} \\&\le& \left( 5 \epsilon d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}}\right)^{1/2} + 2.5 \epsilon d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}}\\ & \le& 6\epsilon^{1/2} d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}}. \end{array} $$

Plugging this, we get that

$$ \begin{array}{@{}rcl@{}} \frac{r\gamma^{2} \varepsilon(Q)}{4b^{2}(1 + h(5\gamma/b^{2}))} \!\!\!&\ge&\!\!\! r \varepsilon(Q)\!\times\! \epsilon^{2} \frac{({\sum}_{v} {d_{v}^{2}})^{2}}{({\sum}_{v} d_{v}) ({\sum}_{v} {d_{v}^{3}})} \!\times\! \left( 1 + 6\epsilon^{1/2} d_{\max} \frac{{\sum}_{v} {d_{v}^{2}} }{{\sum}_{v} {d_{v}^{3}}}\right)^{-1} \\\!\!\!&\ge&\!\!\! \frac{r \varepsilon(Q)\epsilon^{3/2}({\sum}_{v} {d_{v}^{2}})}{6({\sum}_{v} d_{v}) d_{\max}}. \end{array} $$

Setting \(r = \frac {1}{\varepsilon (Q) \epsilon ^{3/2}} \times \frac {6({\sum }_{v} d_{v}) d_{\max \limits }}{({\sum }_{v} {d_{v}^{2}})} \log (1/\delta )\), and using Theorem 4, we can claim that, with probability 1 − δ,

$$ T_{2}(A) \in \left( (1 - \epsilon)\frac{{\sum}_{v} {d_{v}^{2}}}{{\sum}_{v} d_{v}}, (1 + \epsilon)\frac{{\sum}_{v} {d_{v}^{2}}}{{\sum}_{v} d_{v}} \right). $$

The bound on the number of nodes touched/queried by the algorithm follows naturally. □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dasgupta, A., Sengupta, S. Scalable Estimation of Epidemic Thresholds via Node Sampling. Sankhya A 84, 321–344 (2022). https://doi.org/10.1007/s13171-021-00249-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-021-00249-0

Keywords

  • Epidemic threshold
  • Networks
  • Sampling
  • Random walk
  • Configuration model
  • Epidemiology.

PACS Nos

  • 62F10 (primary)
  • 68W20, 68W25