Skip to main content

Fast crawling methods of exploring content distributed over large graphs

Abstract

Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703

  2. Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455

  3. Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109

  4. Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632

  5. Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28

    Article  Google Scholar 

  6. Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100

  7. Boyd S et al (2004) Fastest mixing Markov chain on a graph. SIAM Rev 46(4):667–689

    MathSciNet  Article  MATH  Google Scholar 

  8. Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262

  9. Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264

  10. Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335

    Google Scholar 

  11. Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243

  12. Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336

  13. Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905

    Google Scholar 

  14. Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506

  15. Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223

    Article  Google Scholar 

  16. Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109

    MathSciNet  Article  MATH  Google Scholar 

  17. Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34

    Article  Google Scholar 

  18. Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. JASA 47(260):663–685

    MathSciNet  Article  MATH  Google Scholar 

  19. Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597

  20. Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716

  21. Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758

    Article  Google Scholar 

  22. Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606

  23. Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292

  24. Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30

  25. Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9

  26. Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809

    Google Scholar 

  27. Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686

  28. Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890

  29. Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694

  30. Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46

    Google Scholar 

  31. Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890

  32. Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092

    Google Scholar 

  33. Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42

  34. Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403

  35. Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025

    Google Scholar 

  36. Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395

    Google Scholar 

  37. Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881

  38. Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705

  39. Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6

  40. Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403

  41. Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700

  42. Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239

    Article  Google Scholar 

  43. Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307

    MathSciNet  Article  Google Scholar 

  44. Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834

  45. Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390

    Google Scholar 

  46. Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184

  47. Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846

  48. Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27

    Article  Google Scholar 

  49. Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061

  50. Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359

    Google Scholar 

  51. Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026

    Google Scholar 

  52. Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6

  53. Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33

    Article  Google Scholar 

  54. Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55

    MathSciNet  Article  Google Scholar 

  55. Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junzhou Zhao.

Appendix

Appendix

MLE method

We present the MLE of \({\varvec{\omega }}\) only for graph sampling method UNI, because it is not easy to derive the MLE of \({\varvec{\omega }}\) for other graph sampling methods such as RW, MHRW, and FS. Suppose that the graph size is known (this can be estimated by sampling methods proposed in [22]), \(n < |V|\) vertices are sampled and then each copy of \(\mathbf {c}\) is sampled with the same probability \(p=\frac{n}{|V|}\). For simplicity, we assume that content is distributed over networks uniformly at random. Let M be the maximum number of copies that content has. Denote \(P_{i,j}\) as the probability of sampling i copies for content totally having j copies, where \( 1 \le i \le j \le M\). Let \(q=1-p\), we have \(P_{i,j}=\frac{\left( {\begin{array}{c}j\\ i\end{array}}\right) p^i q^{j-i}}{1 - q^j}\). We compute the MLE of \({\varvec{\omega }}\) from sampled content copies in respect to the following two cases:

Case 1 When the content label under study is the number of copies associated with content. For randomly sampled content, let \(\alpha _i\) (\(1\le i\le M\)) be the probability that it has i copies sampled. Among sampled content, let \(x_i\) be the fraction of content with i copies sampled. We have \(\mathbb {E}(x_i) = \alpha _i\). Thus, \(x_i\) is an unbiased estimate of \(\alpha _i\). Next, we present a method to estimate \({\varvec{\omega }}\) based on the relationship of \(\alpha _i\) and \({\varvec{\omega }}\). The likelihood function of \(\alpha _i\) is

$$\begin{aligned} \alpha _{i}=\sum _{j=i}^{M} \omega _j P_{i,j}. \end{aligned}$$
(5)

This is similar to packet sampling-based flow size distribution estimation studied in [12], where each packet is sampled with probability p. Here a flow refers to a group of packets with the same source and destination, and the flow size is the number of packets that it contains. In our context, content corresponds to a flow, and its copies correspond to packets in the flow. Therefore, we can develop a maximum likelihood estimate \(\hat{\omega }_k^{\text {MLE}}\) of \(\omega _k\) (\(1\le k\le M\)) similar to the method proposed in [12].

Case 2 When the content label under study is independent with the number of duplicates and it is available in each content copy, which is not a latent property such as the number of copies content has, we use the following approach to derive the MLE. Define \(\beta _{k,j}\) (\(0\le k\le K\), \(1\le j\le M\)) as the fraction of the number of content with label \(l_k\) and j copies over the number of content with label \(l_k\). For sampled content, let \(\alpha _{k,i}\) (\(1\le i\le M\)) be the probability that its content label is \(l_k\) and has i copies sampled. Then, the likelihood function of \(\alpha _{k,i}\) is

$$\begin{aligned} \alpha _{k,i}=\sum _{j=i}^M \beta _{k,j} P_{i,j}. \end{aligned}$$

\(\alpha _{k,i}\) can be estimated based on sampled content copies. That is, among sampled content, let \(x_{k,i}\) be the fraction of content with label \(l_k\) that has i copies sampled. We have \(\mathbb {E}(x_{k,i}) = \alpha _{k,i}\). Therefore, \(x_{k,i}\) is an unbiased estimate of \(\alpha _{k,i}\). Similar to (5), we then develop a maximum likelihood estimate \(\hat{\beta }_{k,j}\) of \(\beta _{k,j}\), \(1\le j\le M\). Since

$$\begin{aligned} \alpha _k=\omega _k \sum _{i=1}^M \sum _{j=i}^M \beta _{k,j} P_{i,j}, \end{aligned}$$

we have the following estimator of \(\omega _k\)

$$\begin{aligned} \hat{\omega }_k^{\text {MLE}}=\frac{\hat{\alpha }_k}{S^{\text {MLE}}\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}, \quad 0\le k\le K, \end{aligned}$$

where \(\hat{\alpha }_k\) is the fraction of sampled content with label \(l_k\), and

$$\begin{aligned} S^{\text {MLE}}=\sum _{k=0}^K \frac{\hat{\alpha }_k}{\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}. \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Zhao, J., Lui, J.C.S. et al. Fast crawling methods of exploring content distributed over large graphs. Knowl Inf Syst 59, 67–92 (2019). https://doi.org/10.1007/s10115-018-1178-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1178-x

Keywords

  • Crawling
  • Online social networks
  • Sampling
  • Random walks