Abstract
Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.
Similar content being viewed by others
References
Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703
Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455
Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109
Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632
Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28
Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100
Boyd S et al (2004) Fastest mixing Markov chain on a graph. SIAM Rev 46(4):667–689
Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262
Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264
Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335
Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243
Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336
Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905
Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506
Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34
Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. JASA 47(260):663–685
Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597
Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716
Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606
Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292
Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30
Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9
Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809
Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686
Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890
Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694
Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46
Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890
Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092
Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42
Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403
Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025
Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395
Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881
Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705
Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6
Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403
Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700
Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239
Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307
Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834
Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390
Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184
Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846
Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27
Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061
Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359
Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026
Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6
Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33
Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55
Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36
Acknowledgements
The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 MLE method
We present the MLE of \({\varvec{\omega }}\) only for graph sampling method UNI, because it is not easy to derive the MLE of \({\varvec{\omega }}\) for other graph sampling methods such as RW, MHRW, and FS. Suppose that the graph size is known (this can be estimated by sampling methods proposed in [22]), \(n < |V|\) vertices are sampled and then each copy of \(\mathbf {c}\) is sampled with the same probability \(p=\frac{n}{|V|}\). For simplicity, we assume that content is distributed over networks uniformly at random. Let M be the maximum number of copies that content has. Denote \(P_{i,j}\) as the probability of sampling i copies for content totally having j copies, where \( 1 \le i \le j \le M\). Let \(q=1-p\), we have \(P_{i,j}=\frac{\left( {\begin{array}{c}j\\ i\end{array}}\right) p^i q^{j-i}}{1 - q^j}\). We compute the MLE of \({\varvec{\omega }}\) from sampled content copies in respect to the following two cases:
Case 1 When the content label under study is the number of copies associated with content. For randomly sampled content, let \(\alpha _i\) (\(1\le i\le M\)) be the probability that it has i copies sampled. Among sampled content, let \(x_i\) be the fraction of content with i copies sampled. We have \(\mathbb {E}(x_i) = \alpha _i\). Thus, \(x_i\) is an unbiased estimate of \(\alpha _i\). Next, we present a method to estimate \({\varvec{\omega }}\) based on the relationship of \(\alpha _i\) and \({\varvec{\omega }}\). The likelihood function of \(\alpha _i\) is
This is similar to packet sampling-based flow size distribution estimation studied in [12], where each packet is sampled with probability p. Here a flow refers to a group of packets with the same source and destination, and the flow size is the number of packets that it contains. In our context, content corresponds to a flow, and its copies correspond to packets in the flow. Therefore, we can develop a maximum likelihood estimate \(\hat{\omega }_k^{\text {MLE}}\) of \(\omega _k\) (\(1\le k\le M\)) similar to the method proposed in [12].
Case 2 When the content label under study is independent with the number of duplicates and it is available in each content copy, which is not a latent property such as the number of copies content has, we use the following approach to derive the MLE. Define \(\beta _{k,j}\) (\(0\le k\le K\), \(1\le j\le M\)) as the fraction of the number of content with label \(l_k\) and j copies over the number of content with label \(l_k\). For sampled content, let \(\alpha _{k,i}\) (\(1\le i\le M\)) be the probability that its content label is \(l_k\) and has i copies sampled. Then, the likelihood function of \(\alpha _{k,i}\) is
\(\alpha _{k,i}\) can be estimated based on sampled content copies. That is, among sampled content, let \(x_{k,i}\) be the fraction of content with label \(l_k\) that has i copies sampled. We have \(\mathbb {E}(x_{k,i}) = \alpha _{k,i}\). Therefore, \(x_{k,i}\) is an unbiased estimate of \(\alpha _{k,i}\). Similar to (5), we then develop a maximum likelihood estimate \(\hat{\beta }_{k,j}\) of \(\beta _{k,j}\), \(1\le j\le M\). Since
we have the following estimator of \(\omega _k\)
where \(\hat{\alpha }_k\) is the fraction of sampled content with label \(l_k\), and
Rights and permissions
About this article
Cite this article
Wang, P., Zhao, J., Lui, J.C.S. et al. Fast crawling methods of exploring content distributed over large graphs. Knowl Inf Syst 59, 67–92 (2019). https://doi.org/10.1007/s10115-018-1178-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1178-x