# Efficiently spotting the starting points of an epidemic in a large graph

- 560 Downloads
- 19 Citations

## Abstract

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

## Keywords

Epidemics Diffusion Culprits Seeds## Notes

### Acknowledgments

This material is based upon work supported by the Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and the National Science Foundation under Grant No. IIS-1017415. Jilles Vreeken is supported by a Postdoctoral Fellowship of the Research Foundation—Flanders (fwo).

## References

- 1.Anderson RM, May RM (1991) Infectious diseases of humans: dynamics and control. Oxford University Press, OxfordGoogle Scholar
- 2.Bikhchandani S, Hirshleifer D, Welch I (1992) A theory of fads, fashion, custom, and cultural change in informational cascades. Polit Econ 100(5):992–1026Google Scholar
- 3.Briesemeister L, Lincoln P, Porras P (2003) Epidemic profiles and defense of scale-free networks. In: WORM 2003, Washington, DCGoogle Scholar
- 4.Cilibrasi R, Vitányi P (2005) Clustering by compression. IEEE Trans Inf Technol 51(4):1523–1545CrossRefGoogle Scholar
- 5.Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, New York, pp 110–112MATHGoogle Scholar
- 6.Cvetković DM, Doob M, Sachs H (1998) Spectra of graphs: theory and applications, 3rd ednGoogle Scholar
- 7.Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 79–88Google Scholar
- 8.Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. TISSEC 10(4)Google Scholar
- 9.Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10). ACM, New York, pp 1029–1038. doi: 10.1145/1835804.1835934. http://doi.acm.org/10.1145/1835804.1835934
- 10.Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. In: Webb G (ed) Data mining and knowledge discovery, vol 15. Springer, Berlin, pp 3–20Google Scholar
- 11.Grünwald P (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
- 12.Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market Lett 12(3):211–223Google Scholar
- 13.Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on World Wide Web (WWW)Google Scholar
- 14.Ganesh A, Massoulié L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: INFOCOMGoogle Scholar
- 15.Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, CanadaGoogle Scholar
- 16.Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: SPGoogle Scholar
- 17.Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DCGoogle Scholar
- 18.Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, BerlinCrossRefMATHGoogle Scholar
- 19.Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: ECGoogle Scholar
- 20.Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007a) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 420–429Google Scholar
- 21.Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs: patterns and a model. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MNGoogle Scholar
- 22.Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1059–1068Google Scholar
- 23.McCuler CR (2000) The many proofs and applications of Perron’s theorem. SIAM Rev 42:1Google Scholar
- 24.Pastor-Santorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:14CrossRefGoogle Scholar
- 25.Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, SpainGoogle Scholar
- 26.Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, CanadaGoogle Scholar
- 27.Prakash BA, Chakrabarti D, Valler N, Faloutsos M, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33(3):549–575CrossRefGoogle Scholar
- 28.Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATHGoogle Scholar
- 29.Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431CrossRefMATHMathSciNetGoogle Scholar
- 30.Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton, AlbertaGoogle Scholar
- 31.Roos T, Rissanen J (2008) On sequentially normalized maximum likelihood models. In: Proceedings of the workshop on information theoretic methods in science and engineering (WITMSE)Google Scholar
- 32.Saito K, Kimura M, Ohara K, Motoda H (2012) Efficient discovery of influential nodes for sis models in social networks. Knowl Inf Syst 30(3):613–635CrossRefGoogle Scholar
- 33.Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464CrossRefMATHGoogle Scholar
- 34.Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jonanovich, San DiegoGoogle Scholar
- 35.Shah D, Zaman T (2010) Detecting sources of computer viruses in networks: theory and experiment. In: SIGMETRICS, pp 203–214Google Scholar
- 36.Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Technol 57(8):5163–5181CrossRefMathSciNetGoogle Scholar
- 37.Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, society for industrial and applied mathematics (SIAM), pp 804–815Google Scholar
- 38.Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, AustraliaGoogle Scholar
- 39.Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Technol 50(12):3265–3290CrossRefMathSciNetGoogle Scholar
- 40.Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214CrossRefMATHMathSciNetGoogle Scholar
- 41.Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, BerlinMATHGoogle Scholar
- 42.Zhao J, Wu J, Feng X, Xiong H, Xu K (2011) Information propagation in online social networks: a tie-strength perspective. Knowl Inf Syst. 1–20. doi: 10.1007/s10115-011-0445-x