Knowledge and Information Systems

, Volume 38, Issue 1, pp 35–59 | Cite as

Efficiently spotting the starting points of an epidemic in a large graph

  • B. Aditya Prakash
  • Jilles Vreeken
  • Christos Faloutsos
Regular paper

Abstract

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

Keywords

Epidemics Diffusion Culprits Seeds 

References

  1. 1.
    Anderson RM, May RM (1991) Infectious diseases of humans: dynamics and control. Oxford University Press, OxfordGoogle Scholar
  2. 2.
    Bikhchandani S, Hirshleifer D, Welch I (1992) A theory of fads, fashion, custom, and cultural change in informational cascades. Polit Econ 100(5):992–1026Google Scholar
  3. 3.
    Briesemeister L, Lincoln P, Porras P (2003) Epidemic profiles and defense of scale-free networks. In: WORM 2003, Washington, DCGoogle Scholar
  4. 4.
    Cilibrasi R, Vitányi P (2005) Clustering by compression. IEEE Trans Inf Technol 51(4):1523–1545CrossRefGoogle Scholar
  5. 5.
    Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, New York, pp 110–112MATHGoogle Scholar
  6. 6.
    Cvetković DM, Doob M, Sachs H (1998) Spectra of graphs: theory and applications, 3rd ednGoogle Scholar
  7. 7.
    Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 79–88Google Scholar
  8. 8.
    Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. TISSEC 10(4)Google Scholar
  9. 9.
    Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10). ACM, New York, pp 1029–1038. doi:10.1145/1835804.1835934. http://doi.acm.org/10.1145/1835804.1835934
  10. 10.
    Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. In: Webb G (ed) Data mining and knowledge discovery, vol 15. Springer, Berlin, pp 3–20Google Scholar
  11. 11.
    Grünwald P (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
  12. 12.
    Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market Lett 12(3):211–223Google Scholar
  13. 13.
    Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on World Wide Web (WWW)Google Scholar
  14. 14.
    Ganesh A, Massoulié L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: INFOCOMGoogle Scholar
  15. 15.
    Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, CanadaGoogle Scholar
  16. 16.
    Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: SPGoogle Scholar
  17. 17.
    Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DCGoogle Scholar
  18. 18.
    Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, BerlinCrossRefMATHGoogle Scholar
  19. 19.
    Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: ECGoogle Scholar
  20. 20.
    Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007a) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 420–429Google Scholar
  21. 21.
    Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs: patterns and a model. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MNGoogle Scholar
  22. 22.
    Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1059–1068Google Scholar
  23. 23.
    McCuler CR (2000) The many proofs and applications of Perron’s theorem. SIAM Rev 42:1Google Scholar
  24. 24.
    Pastor-Santorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:14CrossRefGoogle Scholar
  25. 25.
    Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, SpainGoogle Scholar
  26. 26.
    Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, CanadaGoogle Scholar
  27. 27.
    Prakash BA, Chakrabarti D, Valler N, Faloutsos M, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33(3):549–575CrossRefGoogle Scholar
  28. 28.
    Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATHGoogle Scholar
  29. 29.
    Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton, AlbertaGoogle Scholar
  31. 31.
    Roos T, Rissanen J (2008) On sequentially normalized maximum likelihood models. In: Proceedings of the workshop on information theoretic methods in science and engineering (WITMSE)Google Scholar
  32. 32.
    Saito K, Kimura M, Ohara K, Motoda H (2012) Efficient discovery of influential nodes for sis models in social networks. Knowl Inf Syst 30(3):613–635CrossRefGoogle Scholar
  33. 33.
    Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464CrossRefMATHGoogle Scholar
  34. 34.
    Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jonanovich, San DiegoGoogle Scholar
  35. 35.
    Shah D, Zaman T (2010) Detecting sources of computer viruses in networks: theory and experiment. In: SIGMETRICS, pp 203–214Google Scholar
  36. 36.
    Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Technol 57(8):5163–5181CrossRefMathSciNetGoogle Scholar
  37. 37.
    Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, society for industrial and applied mathematics (SIAM), pp 804–815Google Scholar
  38. 38.
    Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, AustraliaGoogle Scholar
  39. 39.
    Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Technol 50(12):3265–3290CrossRefMathSciNetGoogle Scholar
  40. 40.
    Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214CrossRefMATHMathSciNetGoogle Scholar
  41. 41.
    Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, BerlinMATHGoogle Scholar
  42. 42.
    Zhao J, Wu J, Feng X, Xiong H, Xu K (2011) Information propagation in online social networks: a tie-strength perspective. Knowl Inf Syst. 1–20. doi:10.1007/s10115-011-0445-x

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • B. Aditya Prakash
    • 1
  • Jilles Vreeken
    • 2
  • Christos Faloutsos
    • 3
  1. 1.Department of Computer ScienceVirginia Tech. BlacksburgUSA
  2. 2.Advanced Database Research and ModelingUniversity of Antwerp AntwerpBelgium
  3. 3.Department of Computer ScienceCarnegie Mellon University PittsburghUSA

Personalised recommendations