Knowledge and Information Systems

, Volume 59, Issue 1, pp 67–92 | Cite as

Fast crawling methods of exploring content distributed over large graphs

  • Pinghui Wang
  • Junzhou ZhaoEmail author
  • John C. S. Lui
  • Don Towsley
  • Xiaohong Guan
Regular Paper


Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.


Crawling Online social networks Sampling Random walks 



The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).


  1. 1.
    Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703Google Scholar
  2. 2.
    Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455Google Scholar
  3. 3.
    Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109Google Scholar
  4. 4.
    Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632Google Scholar
  5. 5.
    Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28CrossRefGoogle Scholar
  6. 6.
    Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100Google Scholar
  7. 7.
    Boyd S et al (2004) Fastest mixing Markov chain on a graph. SIAM Rev 46(4):667–689MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262Google Scholar
  9. 9.
    Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264Google Scholar
  10. 10.
    Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335Google Scholar
  11. 11.
    Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243Google Scholar
  12. 12.
    Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336Google Scholar
  13. 13.
    Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905Google Scholar
  14. 14.
    Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506Google Scholar
  15. 15.
    Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223CrossRefGoogle Scholar
  16. 16.
    Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34CrossRefGoogle Scholar
  18. 18.
    Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. JASA 47(260):663–685MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597Google Scholar
  20. 20.
    Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716Google Scholar
  21. 21.
    Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758CrossRefGoogle Scholar
  22. 22.
    Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606Google Scholar
  23. 23.
    Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292Google Scholar
  24. 24.
    Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30Google Scholar
  25. 25.
    Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9Google Scholar
  26. 26.
    Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809Google Scholar
  27. 27.
    Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686Google Scholar
  28. 28.
    Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890Google Scholar
  29. 29.
    Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694Google Scholar
  30. 30.
    Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46Google Scholar
  31. 31.
    Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890Google Scholar
  32. 32.
    Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092Google Scholar
  33. 33.
    Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42Google Scholar
  34. 34.
    Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403Google Scholar
  35. 35.
    Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025Google Scholar
  36. 36.
    Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395Google Scholar
  37. 37.
    Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881Google Scholar
  38. 38.
    Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705Google Scholar
  39. 39.
    Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6Google Scholar
  40. 40.
    Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403Google Scholar
  41. 41.
    Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700Google Scholar
  42. 42.
    Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239CrossRefGoogle Scholar
  43. 43.
    Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307MathSciNetCrossRefGoogle Scholar
  44. 44.
    Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834Google Scholar
  45. 45.
    Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390Google Scholar
  46. 46.
    Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184Google Scholar
  47. 47.
    Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846Google Scholar
  48. 48.
    Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27CrossRefGoogle Scholar
  49. 49.
    Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061Google Scholar
  50. 50.
    Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359Google Scholar
  51. 51.
    Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026Google Scholar
  52. 52.
    Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6Google Scholar
  53. 53.
    Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33CrossRefGoogle Scholar
  54. 54.
    Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55MathSciNetCrossRefGoogle Scholar
  55. 55.
    Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  • Pinghui Wang
    • 1
    • 2
  • Junzhou Zhao
    • 3
    Email author
  • John C. S. Lui
    • 4
  • Don Towsley
    • 5
  • Xiaohong Guan
    • 1
    • 6
  1. 1.MOE Key Laboratory for Intelligent Networks and Network SecurityXi’an Jiaotong UniversityXi’anChina
  2. 2.Shenzhen Research Institute of Xi’an Jiaotong UniversityShenzhenChina
  3. 3.Division of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwalSaudi Arabia
  4. 4.Department of Computer Science and EngineeringThe Chinese University of Hong KongSha TinHong Kong
  5. 5.Department of Computer ScienceUniversity of Massachusetts AmherstAmherstUS
  6. 6.Center for Intelligent and Networked SystemsTsinghua UniversityBeijingChina

Personalised recommendations