## Abstract

Given a contact network and coarse-grained diagnostic information such as electronic Healthcare Reimbursement Claims (eHRC) data, can we develop efficient intervention policies from data to control an epidemic? Immunization is an important problem in multiple areas, especially epidemiology and public health. However, most existing studies rely on assuming prior epidemiological models to develop pre-emptive strategies, which may fail to adapt to the change in new epidemiological patterns and the availability of rich data such as eHRC. In practice, disease spread is usually complicated, hence assuming an underlying model may deviate from true spreading patterns, leading to possibly inaccurate interventions. Additionally, the abundance of health care surveillance data (such as eHRC) makes it possible to study data-driven strategies without too many restrictive assumptions. Hence, such a data-driven intervention approach can help public-health experts take more practical decisions. In this paper, we take into account propagation log and contact networks for controlling propagation. Different from previous model-based approaches, our solutions are solely data driven in a sense that we develop immunization strategies directly from the network and eHRC without assuming classical epidemiological models. In particular, we formulate the novel and challenging *data-driven immunization* problem. To solve it, we first propose an efficient sampling approach to align surveillance data with contact networks, then develop an efficient algorithm with the provably approximate guarantee for immunization. Finally, we show the effectiveness and scalability of our methods via extensive experiments on multiple datasets, and conduct case studies on nation-wide real medical surveillance data.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
Code in Python: https://goo.gl/tsMueB.

- 2.
- 3.
We extract vaccine reports based on ICD-9 codes V04.81. These are actual vaccine allocations as given in the eHRC data.

## References

- 1.
Medlock J, Galvani AP (2009) Optimizing influenza vaccine distribution. Science 325:1705–1708

- 2.
Halloran ME, Ferguson NM, Eubank S, Longini IM, Cummings DAT, Lewis B, Xu S, Fraser C, Vullikanti A, Germann TC, Wagener D, Beckman R, Kadau K, Barrett C, Macken CA, Burke DS, Cooley P (2008) Modeling targeted layered containment of an influenza pandemic in the United States. In: Proceedings of the National Academy of Sciences (PNAS), March 10 2008, pp 4639–4644

- 3.
Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: ICDM

- 4.
Zhang Y, Adiga A, Vullikanti A, Prakash BA (2015) Controlling propagation at group scale on networks. In: 2015 IEEE international conference on data mining (ICDM). IEEE, pp 619–628

- 5.
Zhang Y, Prakash BA (2014) Dava: distributing vaccines over networks under prior information. In: Proceedings of the SIAM data mining conference, ser. SDM’14

- 6.
Pellis L, Ball F, Bansal S, Eames K, House T, Isham V, Trapman P (2015) Eight challenges for network epidemic models. Epidemics 10:58–62

- 7.
Ramanathan A, Pullum LL, Hobson TC, Steed CA, Quinn SP, Chennubhotla CS, Valkova S (2015) Orbit: Oak Ridge biosurveillance toolkit for public health dynamics. BMC Bioinform 16(17):S4

- 8.
Ozmen O, Pullum LL, Ramanathan A, Nutaro JJ (2016) Augmenting epidemiological models with point-of-care diagnostics data. PLoS ONE 11(4):1–13 04

- 9.
Barrett CL, Beckman RJ, Khan M, Anil Kumar VS, Marathe MV, Stretz PE, Dutta T, Lewis B (2009) Generation and analysis of large synthetic social contact networks. In: Winter simulation conference, pp 1003–1014

- 10.
Eubank S, Guclu H, Anil Kumar VS, Marathe MV, Srinivasan A, Toroczkai Z, Wang N (2004) Modelling disease outbreaks in realistic urban social networks. Nature 429(6988):180–184

- 11.
Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33:549–575

- 12.
Tong H, Prakash BA, Eliassi-Rad T, Faloutsos M, Faloutsos C (2012) Gelling, and melting, large graphs by edge manipulation. In: Proceedings of CIKM

- 13.
Anderson RM, May RM (1991) Infectious diseases of humans. Oxford University Press, Oxford

- 14.
Karp RM (1972) Reducibility among combinatorial problems. In: Complexity of computer computations. Springer, pp 85–103

- 15.
Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions—I. Math Program 14(1):265–294

- 16.
Palmer CR, Gibbons PB, Faloutsos C (2002) Anf: a fast and scalable tool for data mining in massive graphs. Ser. KDD ’02. ACM, New York, NY, USA, pp 81–90

- 17.
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209

- 18.
McDaid AF, Murphy B, Friel N, Hurley N (2012) Clustering in networks with the collapsed stochastic block model. arXiv preprint arXiv:1203.3083

- 19.
Kumar R, Novak J, Raghavan P, Tomkins A (2003) On the bursty evolution of blogspace. In: WWW’03, pp 568–576

- 20.
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: KDD’03

- 21.
Goyal A, Bonchi F, Lakshmanan LV (2011) A data-based approach to social influence maximization. Proc VLDB Endow 5(1):73–84

- 22.
Hethcote HW (2000) The mathematics of infectious diseases. SIAM Rev 42:599–653

- 23.
Ganesh A, Massoulie L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: Proceedings of INFOCOM

- 24.
Cohen R, Havlin S, Ben Avraham D (2003) Efficient immunization strategies for computer networks and populations. Phys Rev Lett 91(24):247901

- 25.
Aspnes J, Chang K, Yampolskiy A (2005) Inoculation strategies for victims of viruses and the sum-of-squares partition problem. In: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, series SODA’05, pp 43–52

- 26.
Van Mieghem P, Stevanović D, Kuipers F, Li C, Van De Bovenkamp R, Liu D, Wang H (2011) Decreasing the spectral radius of a graph by link removals. Phys Rev E 84(1):016101

- 27.
Prakash BA, Adamic LA, Iwashyna TJ, Tong H, Faloutsos C (2013) Fractional immunization in networks. In: Proceedings of SDM, pp 659–667

- 28.
Shim E (2013) Optimal strategies of social distancing and vaccination against seasonal influenza. Math Biosci Eng 10(5):1615–1634

- 29.
Khalil EB, Dilkina B, Song L (2014) Scalable diffusion-aware optimization of network topology. In: KDD 2014. ACM, pp 1226–1235

- 30.
Saha B, Gupta S, Phung D, Venkatesh S (2017) Effective sparse imputation of patient conditions in electronic medical records for emergency risk predictions. Knowl Inf Syst 53(1):179–206. https://doi.org/10.1007/s10115-017-1038-0

- 31.
Patwardhan A, Bilkovski R (2012) Comparison: flu prescription sales data from a retail pharmacy in the US with google flu trends and US ilinet (cdc) data as flu activity indicator. PloS ONE 7(8):e43611

- 32.
Gog JR, Ballesteros S, Viboud C, Simonsen L, Bjornstad ON, Shaman J, Chao DL, Khan F, Grenfell BT (2014) Spatial transmission of 2009 pandemic influenza in the us. PLoS Comput Biol 10(6):e1003635

- 33.
Malhotra K, Hobson TC, Valkova S, Pullum LL, Ramanathan A (2015) Sequential pattern mining of electronic healthcare reimbursement claims: experiences and challenges in uncovering how patients are treated by physicians. In: 2015 IEEE international conference on big data (big data). IEEE, pp 2670–2679

## Acknowledgements

This paper is based on work partially supported by the NSF (IIS-1353346, CAREER IIS-1750407), the NEH (HG-229283-15), ORNL, the Maryland Procurement Office (H98230-14-C-0127), and a Facebook faculty gift to BAP. AV is partially supported by the following grants: DTRA CNIMS Contract HDTRA1- 11-D-0016-0010, NSF BIG DATA Grant IIS-1633028 and NSF DIBBS Grant ACI-1443054, NSF EAGER SSDIM-1745207. Publication of this article was also funded by ORNL LDRD funding to AR. Oak Ridge National Laboratory (ORNL) (Grant No. Order 4000143330) is operated by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. The US Government retains and the publisher, by accepting the article for publication, acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US Government purposes.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix

### Appendix

### Proof of Lemma 4.4

When \(\alpha _{\mathbf {M}, \ell }\) is optimal, \(\alpha _{\mathbf {M}, \ell }=\alpha ^*_{\mathbf {M}, \ell }\).

Second, let \(\beta _{S_{\ell }}\) be the number of nodes without any parents. Maximizing \(\alpha _{\mathbf {M}, \ell }\) for Problem 3.1 is equivalent to minimizing \(\beta _{S_{\ell }}\) at location \(L_{\ell }\). Suppose \(\beta ^*_{S_{\ell }}\) is maximum number of nodes without any parents in the sample at location \(L_{\ell }\). It is obvious \(\beta ^*_{S_{\ell }} = CN(L_{\ell }, t_0)= |S_L|\). For each timestep \(t_i\), if \(CF_i (S_{\ell }) <CN(L_{\ell }, t_i)\), then \(CN(L_{\ell }, t_i)- CF_i(S_{\ell })\) is the number of nodes that cannot be mapped to the cascade generated by \(S_{\ell }\) at timestep \(t_i\). Hence, \(\theta (S_{\ell })\) is the number of nodes that cannot be mapped to the cascade generated by \(S_{\ell }\). If there exists any \(t_i\) that \(CF_i (S_{\ell }) <CN(L_{\ell }, t_i)\), we can always generate a cascade by mapping all \(CF_i (S_{\ell }) \) nodes into the cascade, then uniformly at random map other \(\theta (S_{\ell })\) nodes into cascade. This way, the number of nodes without any parents, \(\beta _{S_{\ell }} \le \beta ^*_{L_{\ell }}+ \theta (S_{\ell }) \) as \(\theta (S_{\ell })\) nodes can have connection within themselves. Since \(\beta _{S_{\ell }} + \alpha _{S_{\ell }} = \sum _{t_i} N(L_i, t_i)\), then \( \alpha _{\mathbf {M}, \ell } \ge \alpha ^*_{\mathbf {M}, \ell } - \theta (S_{\ell })\). Hence, \(\alpha ^*_{\mathbf {M}, \ell } - \theta (S_{\ell }) \le \alpha _{\mathbf {M}, \ell } \le \alpha ^*_{\mathbf {M}, \ell }\). When \(\alpha _{\mathbf {M}, \ell }=\alpha ^*_{\mathbf {M}, \ell }\), \(\theta (S_{\ell }) =0\). \(\square \)

### Proof of Lemma 4.5

First, it is clear that \(g(\emptyset )=0\).

Second, to prove *g*(*S*) is monotonic increasing, we need to prove \(\theta (S)\) is a monotonic decreasing function. To do that, we first show that \(CF_{i}(S_{\ell })\) is monotone non-decreasing and submodular functions for any *i* and \(L_{\ell }\). First, let us define \(f_{i}(S_{\ell })\) as the number of nodes in \(L_{\ell }\) that \(S_{\ell }\) can reach in *i*-hops; hence, \(f_{i}(S_{\ell }) \le f_{i}(S_k)\) when \(S_{\ell }\subseteq S_k\). Second, given \(S_{\ell }\subseteq S_k\) and a node *u*, \(f_{i}(S_{\ell }\cup \{u\}) - f_{i}(S_{\ell })\) is marginal gain of a set union. Since the function in the set union problem is submodular [14], \(f_{i}(S_{\ell })\) is also submodular. Since \(f_{i}(S_{\ell })\) is monotone non-decreasing and submodular, the cumulative function \(CF_{i}(S_{\ell })\) is also non-decreasing and submodular.

Let \(X_i=[{\mathbb {1}}_{CF_i (A \cup B) <CN_i } (CN_i - CF_i(A \cup B))]\), \(Y_i=[{\mathbb {1}}_{CF_i (A) <CN_i } (CN_i - CF_i(A))] \). For any set *A* and *B*,

For any *i*, let us consider the following two cases:

(1) If \({\mathbb {1}}_{CF_i (A) <CN_i }=0\), it means \(CF_i (A) \ge CN_i\), then \(CF_i (A\cup B) \ge CN_i\); hence, \({\mathbb {1}}_{CF_i (A \cup B) <CN_i }=0\). We have \(X_i - Y_i=0\).

(2) If \({\mathbb {1}}_{CF_i (A) <CN_i }=1\), we have two cases:

(2a) \({\mathbb {1}}_{CF_i (A \cup B) <CN_i }=0\), then \(X_i -Y_i = -Y_i = - (CN_i - CF_i(A)) <0\);

(2b) \({\mathbb {1}}_{CF_i (A \cup B) <CN_i }=1\), then \( X_i - Y_i = (CN_i - CF_i(A \cup B))- (CN_i - CF_i(A)) = CF_i(A) - CF_i(A\cup B) \le 0\) (using Claim 2).

Putting together, we have \(\theta (A \cup B) \le \theta (A)\). Hence, \(\theta (S)\) is monotonic decreasing, and hence *g*(*S*) is monotonic increasing.

Third, to prove *g*(*S*) is submodular, For any location *l*, we need to prove that, given \(S \subseteq T\), \(g(S \cup \{a\}) - g(S) \ge g(T \cup \{a\}) - g(T)\), which is equivalent to \(\theta (S )-\theta (S\cup \{a\}) \le \theta (T)-\theta (T\cup \{a\}) \) (supermodularity). Let us write

\(\delta (S,a,i)= [{\mathbb {1}}_{CF_i (S \cup \{a\})<CN_i } (CN_i - CF_i(S \cup \{a\}))] -[{\mathbb {1}}_{CF_i (S) <CN_i } (CN_i - CF_i(S))]\), and

\(\delta (T,a,i) = [{\mathbb {1}}_{CF_i (T \cup \{a\})<CN_i } (CN_i - CF_i(T \cup \{a\}))] -[{\mathbb {1}}_{CF_i (T) <CN_i } (CN_i - CF_i(T))]\), then,

\(\theta (S) - \theta (S \cup \{a\}) = \sum _{i=1}^t \delta (S,a,i) \), and \(\theta (T) - \theta (T \cup \{a\}) = \sum _{i=1}^t \delta (T,a,i) \).

For any *i*, let us consider the following two cases:

(1) If \({\mathbb {1}}_{CF_i (S) <CN_i }=0\), then \({\mathbb {1}}_{CF_i (S \cup \{a\})<CN_i }={\mathbb {1}}_{CF_i (T)<CN_i }={\mathbb {1}}_{CF_i (T\cup \{a\}) <CN_i }=0\). Hence, \(\delta (S,a,i)=\delta (T,a,i)=0\).

(2) If \({\mathbb {1}}_{CF_i (S) <CN_i }=1\), we have the following cases:

(2a) If \({\mathbb {1}}_{CF_i (T) <CN_i }=0\), then we have \({\mathbb {1}}_{CF_i (T \cup \{a\}) <CN_i }=0\). Let us consider the value of \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }\):

If \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }=0\), then \(\delta (S,a,i) =(CN_i - CF_i(S \cup \{a\})) < 0 = \delta (T,a,i) \).

If \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }=1\), then \(\delta (S,a,i) = CF_i (S) - CF_i(S \cup \{a\}) < 0 = \delta (T,a,i)\).

(2b) If \({\mathbb {1}}_{CF_i (T) <CN_i }=1\), let us consider the value of \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }\):

If \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }=0\), then \({\mathbb {1}}_{CF_i (T \cup \{a\}) <CN_i }=0\), and then \(\delta (S,a,i) = -(CN_i - CF_i(S)) \le -(CN_i - CF_i(T)) = \delta (T,a,i) \) (using Claim 2).

If \({\mathbb {1}}_{CF_i (S \cup \{a\}) <CN_i }=1\), then for \({\mathbb {1}}_{CF_i (T \cup \{a\}) <CN_i }\):

If \({\mathbb {1}}_{CF_i (T \cup \{a\}) <CN_i }=1\), then \(\delta (S,a,i)= CF_i(S) - CF_i(S \cup \{a\})) \le CF_i(T) - CF_i(T \cup \{a\})) =\delta (T, a,i)\) (using Claim 2 that \(CF_i(S)\) is a submodular function).

Otherwise, \({\mathbb {1}}_{CF_i (T \cup \{a\}) <CN_i }=0\), and then since we have \(CF_i (T \cup \{a\}) \ge CN_i\), \(\delta (S,a,i)= CF_i(S) - CF_i(S \cup \{a\})) \le CF_i(T) - CF_i(T \cup \{a\})) \le CF_i(T) - CN_i = \delta (T, a,i)\) (using Claim 2).

Putting all cases together, we have \(\theta (S) - \theta (S \cup \{a\}) \le \theta (T) - \theta (T \cup \{a\})\). Hence, \(g(S \cup \{a\}) - g(S) \ge g(T \cup \{a\}) - g(T)\).

*g*(*S*) is a submodular function. \(\square \)

### Proof of Lemma 4.9

Since we uniformly randomly allocate \(\mathbf {x}\), \(\rho _{G, \mathbf {M}_i}(\mathbf {x})\) can be written as \(\rho _{G, \mathbf {M}_i}(\mathbf {x}) = \sum _S \Pr (S) r_{G, \mathbf {M}_i}(S)\), where *S* is a node set sampled from the random process of distributing \(\mathbf {x}\) (\(|S| = ||\mathbf {x}||_1\)), and \(r_{G, \mathbf {M}_i}(S)\) is the number of nodes \(SI_{\mathbf {M}_i}\) can reach after removing *S*.

Since \(\zeta _{G, \mathbf {M}_i}(\mathbf {x}) = \sum _{S} \Pr (S) C_{G, \mathbf {M}_i}(S)\) and \(\rho _{G, \mathbf {M}_i}(\mathbf {x}) = \sum _S \Pr (S) r_{G, \mathbf {M}_i}(S)\), we need to show that \( r_{G, \mathbf {M}_i}(S) \le C_{G, \mathbf {M}_i}(S)\). \( r_{G, \mathbf {M}_i}(S)\) is the number of nodes *S* can save in \(\mathbf {M}_i\), we can show that given any node *u* that \(SI_\mathbf {M}\) can save, the credit *u* given to \(SI_\mathbf {M}\) must be 1. This is because if we can save *u*, it means every path from \(SI_\mathbf {M}\) to *u* has been removed when *S* is removed. Hence, all nodes within the paths from \(SI_\mathbf {M}\) have been removed. These nodes are all nodes that propagate *u*’s credit to \(SI_\mathbf {M}\), so all credits of *u* can be contributed to \(C_{G, \mathbf {M}_i}(S)\). Hence, \(C_{G, \mathbf {M}_i}(S)\) is at least equal to \(r_{G, \mathbf {M}_i}(S)\). On the other hand, other nodes that *S* cannot save also make contributions to the credit of \(C_{G, \mathbf {M}_i}(S)\). Hence, \(C_{G, \mathbf {M}_i}(S) \ge r_{G, \mathbf {M}_i}(S)\), which leads to \(\rho _{G, \mathbf {M}_i}(\mathbf {x}) \le \zeta _{G, \mathbf {M}_i}(\mathbf {x}) \). \(\square \)

### Proof of Lemma 4.11

We use a similar technique as in [4] given the properties of \(P_1\), \(P_2\) and \(P_3\) of \(\zeta _{G, \mathbf {M}_i} (\mathbf {x})\). For brevity, we write \(\zeta _{G, \mathbf {M}_i} (\mathbf {x})\) as \(\zeta (\mathbf {x})\).

First, we show that if \(\mathbf {y}=(y_i, \ldots ,y_n)^T\) where \(\sum _j y_j=m\), then \(\zeta (\mathbf {x}+ \mathbf {y}) - \zeta (\mathbf {x}) \le \sum _j y_j (\zeta (\mathbf {x}+\mathbf {e}_j)-\zeta (\mathbf {x}))\).

Let \(\mathbf {y}\) can be recursively obtained from a sequence \(\mathbf {e}^{(1)}, \ldots , \mathbf {e}^{(m)}\) (\(\mathbf {e}^{(i)} \in \{\mathbf {e}_1,\ldots ,\mathbf {e}_n\}\)) such that \(\mathbf {y}=\mathbf {y}^{(m)}=\mathbf {y}^{(m-1)}+\mathbf {e}^{(m)}\), \(\mathbf {y}^{(i)}=\mathbf {y}^{(i-1)}+\mathbf {e}^{(i)}\) (\(i \le m\)) and \(\mathbf {y}^{0}=\mathbf {0}\).

Obviously, \(\sum _{i=1}^m \mathbf {e}^{(i)}= \sum _j y_j \mathbf {e}_j =\mathbf {y}\). Then,

Now, let us prove that ImmuNaiveGreedy gives a \((1-1/e)\)-approximate solution. Suppose \(\mathbf {x}\) is the solution from ImmuNaiveGreedy, and \(\mathbf {x}^*\) is the optimal solution. Clearly, we have \(\sum _{j}x_j=\sum _{j}x^*_j=m\). Let us define \(\mathbf {x}^{(i)}\) as the solution got from the *i*th iteration of the greedy algorithm; hence, \(\mathbf {x}=\mathbf {x}^{(m)}\). And \(\mathbf {x}^*\) can be represented as \(\sum _j x^*_j \mathbf {e}_j\). We have

Hence, \(\zeta (\mathbf {x}^{(i+1)}) \ge (1-\frac{1}{m})\zeta (\mathbf {x}^{(i)}) +\frac{1}{m} \zeta (\mathbf {x}^*)\). Recursively, we can get \(\zeta (\mathbf {x}^{(i)}) \ge (1-(1-\frac{1}{m})^i) \zeta (\mathbf {x}^*)\). Therefore, \(\zeta (\mathbf {x})=\zeta (\mathbf {x}^{(m)}) \ge (1-(1-\frac{1}{m})^m) \zeta (\mathbf {x}^*) \ge (1-1/e) \zeta (\mathbf {x}^*)\). \(\square \)

## Rights and permissions

## About this article

### Cite this article

Zhang, Y., Ramanathan, A., Vullikanti, A. *et al.* Data-driven efficient network and surveillance-based immunization.
*Knowl Inf Syst* **61, **1667–1693 (2019). https://doi.org/10.1007/s10115-018-01326-x

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Graph mining
- Social networks
- Immunization
- Diffusion