Fast crawling methods of exploring content distributed over large graphs

Wang, Pinghui; Zhao, Junzhou; Lui, John C. S.; Towsley, Don; Guan, Xiaohong

doi:10.1007/s10115-018-1178-x

Fast crawling methods of exploring content distributed over large graphs

Regular Paper
Published: 15 March 2018

Volume 59, pages 67–92, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pinghui Wang^1,2,
Junzhou Zhao³,
John C. S. Lui⁴,
Don Towsley⁵ &
…
Xiaohong Guan^1,6

487 Accesses
2 Citations
Explore all metrics

Abstract

Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical characterization of large networks using neighborhood information

Article 14 February 2018

Pinghui Wang, Junzhou Zhao, … Xiaohong Guan

Guided sampling for large graphs

Article 18 March 2020

Muhammad Irfan Yousuf & Suhyun Kim

Speed up random walk by leveraging community affiliation information

Article Open access 13 November 2019

Naian Yin, Yachao Lu & Nan Zhang

References

Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703
Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455
Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109
Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632
Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28
Article Google Scholar
Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100
Boyd S et al (2004) Fastest mixing Markov chain on a graph. SIAM Rev 46(4):667–689
Article MathSciNet MATH Google Scholar
Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262
Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264
Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335
Google Scholar
Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243
Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336
Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905
Google Scholar
Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506
Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223
Article Google Scholar
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Article MathSciNet MATH Google Scholar
Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34
Article Google Scholar
Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. JASA 47(260):663–685
Article MathSciNet MATH Google Scholar
Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597
Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716
Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Article Google Scholar
Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606
Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292
Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30
Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9
Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809
Google Scholar
Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686
Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890
Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694
Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46
Google Scholar
Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890
Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092
Google Scholar
Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42
Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403
Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025
Google Scholar
Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395
Google Scholar
Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881
Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705
Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403
Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700
Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239
Article Google Scholar
Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307
Article MathSciNet Google Scholar
Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834
Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390
Google Scholar
Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184
Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846
Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27
Article Google Scholar
Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061
Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359
Google Scholar
Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026
Google Scholar
Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6
Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33
Article Google Scholar
Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55
Article MathSciNet Google Scholar
Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36
MathSciNet Google Scholar

Download references

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).

Author information

Authors and Affiliations

MOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China
Pinghui Wang & Xiaohong Guan
Shenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, China
Pinghui Wang
Division of Computer, Electrical and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Junzhou Zhao
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong
John C. S. Lui
Department of Computer Science, University of Massachusetts Amherst, Amherst, MA, US
Don Towsley
Center for Intelligent and Networked Systems, Tsinghua University, Beijing, China
Xiaohong Guan

Authors

Pinghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Zhao
View author publications
You can also search for this author in PubMed Google Scholar
John C. S. Lui
View author publications
You can also search for this author in PubMed Google Scholar
Don Towsley
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junzhou Zhao.

Appendix

1.1 MLE method

We present the MLE of ${\varvec{\omega }}$ only for graph sampling method UNI, because it is not easy to derive the MLE of ${\varvec{\omega }}$ for other graph sampling methods such as RW, MHRW, and FS. Suppose that the graph size is known (this can be estimated by sampling methods proposed in [22]), $n < |V|$ vertices are sampled and then each copy of $\mathbf {c}$ is sampled with the same probability $p=\frac{n}{|V|}$. For simplicity, we assume that content is distributed over networks uniformly at random. Let M be the maximum number of copies that content has. Denote $P_{i,j}$ as the probability of sampling i copies for content totally having j copies, where $ 1 \le i \le j \le M$. Let $q=1-p$, we have $P_{i,j}=\frac{\left( {\begin{array}{c}j\\ i\end{array}}\right) p^i q^{j-i}}{1 - q^j}$. We compute the MLE of ${\varvec{\omega }}$ from sampled content copies in respect to the following two cases:

Case 1 When the content label under study is the number of copies associated with content. For randomly sampled content, let $\alpha _i$ ($1\le i\le M$) be the probability that it has i copies sampled. Among sampled content, let $x_i$ be the fraction of content with i copies sampled. We have $\mathbb {E}(x_i) = \alpha _i$. Thus, $x_i$ is an unbiased estimate of $\alpha _i$. Next, we present a method to estimate ${\varvec{\omega }}$ based on the relationship of $\alpha _i$ and ${\varvec{\omega }}$. The likelihood function of $\alpha _i$ is

$$\begin{aligned} \alpha _{i}=\sum _{j=i}^{M} \omega _j P_{i,j}. \end{aligned}$$

(5)

This is similar to packet sampling-based flow size distribution estimation studied in [12], where each packet is sampled with probability p. Here a flow refers to a group of packets with the same source and destination, and the flow size is the number of packets that it contains. In our context, content corresponds to a flow, and its copies correspond to packets in the flow. Therefore, we can develop a maximum likelihood estimate $\hat{\omega }_k^{\text {MLE}}$ of $\omega _k$ ($1\le k\le M$) similar to the method proposed in [12].

Case 2 When the content label under study is independent with the number of duplicates and it is available in each content copy, which is not a latent property such as the number of copies content has, we use the following approach to derive the MLE. Define $\beta _{k,j}$ ($0\le k\le K$, $1\le j\le M$) as the fraction of the number of content with label $l_k$ and j copies over the number of content with label $l_k$. For sampled content, let $\alpha _{k,i}$ ($1\le i\le M$) be the probability that its content label is $l_k$ and has i copies sampled. Then, the likelihood function of $\alpha _{k,i}$ is

$$\begin{aligned} \alpha _{k,i}=\sum _{j=i}^M \beta _{k,j} P_{i,j}. \end{aligned}$$

$\alpha _{k,i}$ can be estimated based on sampled content copies. That is, among sampled content, let $x_{k,i}$ be the fraction of content with label $l_k$ that has i copies sampled. We have $\mathbb {E}(x_{k,i}) = \alpha _{k,i}$. Therefore, $x_{k,i}$ is an unbiased estimate of $\alpha _{k,i}$. Similar to (5), we then develop a maximum likelihood estimate $\hat{\beta }_{k,j}$ of $\beta _{k,j}$, $1\le j\le M$. Since

$$\begin{aligned} \alpha _k=\omega _k \sum _{i=1}^M \sum _{j=i}^M \beta _{k,j} P_{i,j}, \end{aligned}$$

we have the following estimator of $\omega _k$

$$\begin{aligned} \hat{\omega }_k^{\text {MLE}}=\frac{\hat{\alpha }_k}{S^{\text {MLE}}\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}, \quad 0\le k\le K, \end{aligned}$$

where $\hat{\alpha }_k$ is the fraction of sampled content with label $l_k$, and

$$\begin{aligned} S^{\text {MLE}}=\sum _{k=0}^K \frac{\hat{\alpha }_k}{\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, P., Zhao, J., Lui, J.C.S. et al. Fast crawling methods of exploring content distributed over large graphs. Knowl Inf Syst 59, 67–92 (2019). https://doi.org/10.1007/s10115-018-1178-x

Download citation

Received: 02 June 2017
Revised: 27 February 2018
Accepted: 09 March 2018
Published: 15 March 2018
Issue Date: 04 April 2019
DOI: https://doi.org/10.1007/s10115-018-1178-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast crawling methods of exploring content distributed over large graphs

Abstract

Access this article

Similar content being viewed by others

Practical characterization of large networks using neighborhood information

Guided sampling for large graphs

Speed up random walk by leveraging community affiliation information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 MLE method

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast crawling methods of exploring content distributed over large graphs

Abstract

Access this article

Similar content being viewed by others

Practical characterization of large networks using neighborhood information

Guided sampling for large graphs

Speed up random walk by leveraging community affiliation information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 MLE method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation