## Abstract

Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.

This is a preview of subscription content, access via your institution.

## References

Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703

Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455

Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109

Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632

Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28

Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100

Boyd S et al (2004) Fastest mixing Markov chain on a graph. SIAM Rev 46(4):667–689

Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262

Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264

Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335

Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243

Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336

Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905

Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506

Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223

Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109

Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34

Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. JASA 47(260):663–685

Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597

Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716

Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758

Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606

Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292

Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30

Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9

Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809

Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686

Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890

Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694

Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46

Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890

Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092

Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42

Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403

Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025

Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395

Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881

Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705

Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6

Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403

Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700

Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239

Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307

Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834

Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390

Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184

Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846

Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27

Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061

Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359

Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026

Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6

Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33

Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55

Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36

## Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).

## Author information

### Authors and Affiliations

### Corresponding author

## Appendix

### Appendix

### MLE method

We present the MLE of \({\varvec{\omega }}\) only for graph sampling method UNI, because it is not easy to derive the MLE of \({\varvec{\omega }}\) for other graph sampling methods such as RW, MHRW, and FS. Suppose that the graph size is known (this can be estimated by sampling methods proposed in [22]), \(n < |V|\) vertices are sampled and then each copy of \(\mathbf {c}\) is sampled with the same probability \(p=\frac{n}{|V|}\). For simplicity, we assume that content is distributed over networks uniformly at random. Let *M* be the maximum number of copies that content has. Denote \(P_{i,j}\) as the probability of sampling *i* copies for content totally having *j* copies, where \( 1 \le i \le j \le M\). Let \(q=1-p\), we have \(P_{i,j}=\frac{\left( {\begin{array}{c}j\\ i\end{array}}\right) p^i q^{j-i}}{1 - q^j}\). We compute the MLE of \({\varvec{\omega }}\) from sampled content copies in respect to the following two cases:

*Case 1* When the content label under study is the number of copies associated with content. For randomly sampled content, let \(\alpha _i\) (\(1\le i\le M\)) be the probability that it has *i* copies sampled. Among sampled content, let \(x_i\) be the fraction of content with *i* copies sampled. We have \(\mathbb {E}(x_i) = \alpha _i\). Thus, \(x_i\) is an unbiased estimate of \(\alpha _i\). Next, we present a method to estimate \({\varvec{\omega }}\) based on the relationship of \(\alpha _i\) and \({\varvec{\omega }}\). The likelihood function of \(\alpha _i\) is

This is similar to packet sampling-based flow size distribution estimation studied in [12], where each packet is sampled with probability *p*. Here a flow refers to a group of packets with the same source and destination, and the flow size is the number of packets that it contains. In our context, content corresponds to a flow, and its copies correspond to packets in the flow. Therefore, we can develop a maximum likelihood estimate \(\hat{\omega }_k^{\text {MLE}}\) of \(\omega _k\) (\(1\le k\le M\)) similar to the method proposed in [12].

*Case 2* When the content label under study is independent with the number of duplicates and it is available in each content copy, which is not a latent property such as the number of copies content has, we use the following approach to derive the MLE. Define \(\beta _{k,j}\) (\(0\le k\le K\), \(1\le j\le M\)) as the fraction of the number of content with label \(l_k\) and *j* copies over the number of content with label \(l_k\). For sampled content, let \(\alpha _{k,i}\) (\(1\le i\le M\)) be the probability that its content label is \(l_k\) and has *i* copies sampled. Then, the likelihood function of \(\alpha _{k,i}\) is

\(\alpha _{k,i}\) can be estimated based on sampled content copies. That is, among sampled content, let \(x_{k,i}\) be the fraction of content with label \(l_k\) that has *i* copies sampled. We have \(\mathbb {E}(x_{k,i}) = \alpha _{k,i}\). Therefore, \(x_{k,i}\) is an unbiased estimate of \(\alpha _{k,i}\). Similar to (5), we then develop a maximum likelihood estimate \(\hat{\beta }_{k,j}\) of \(\beta _{k,j}\), \(1\le j\le M\). Since

we have the following estimator of \(\omega _k\)

where \(\hat{\alpha }_k\) is the fraction of sampled content with label \(l_k\), and

## Rights and permissions

## About this article

### Cite this article

Wang, P., Zhao, J., Lui, J.C.S. *et al.* Fast crawling methods of exploring content distributed over large graphs.
*Knowl Inf Syst* **59, **67–92 (2019). https://doi.org/10.1007/s10115-018-1178-x

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10115-018-1178-x

### Keywords

- Crawling
- Online social networks
- Sampling
- Random walks