Skip to main content

Practical characterization of large networks using neighborhood information

Abstract

Characterizing large complex networks such as online social networks through node querying is a challenging task. Network service providers often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network of interest. Various ad hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for large networks where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web APIs available on most major large networks for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than state-of-the-art methods. We also explore methods to uncover shortest paths between a subset of nodes and detect high degree nodes by sampling only a small fraction of the network of interest. Our results demonstrate that utilizing neighborhood information yields methods that are two orders of magnitude faster than state-of-the-art methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Notes

  1. 1.

    http://www.facebook.com.

  2. 2.

    http://www.weibo.com.

  3. 3.

    http://www.quora.com.

  4. 4.

    http://citeseerx.ist.psu.edu.

  5. 5.

    In directed networks where querying a node retrieves the node¡\(^{-}\)s incoming and outgoing edges.

  6. 6.

    http://scholar.google.com.

  7. 7.

    http://www.flickr.com.

  8. 8.

    http://www.xiami.com.

  9. 9.

    http://citeseerx.ist.psu.edu/.

References

  1. 1.

    Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: SIGKDD, pp 631–636

  2. 2.

    Hubler C et al (2008) Metropolis algorithms for representative subgraph sampling. In: ICDM, pp 283–292

  3. 3.

    Maiya AS, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: SIGKDD, pp 105–113

  4. 4.

    Ahmed NK et al (2012) Network sampling: from static to streaming graphs. TKDD 8(2):7:1–7:56

    Google Scholar 

  5. 5.

    Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243

  6. 6.

    Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403

  7. 7.

    Gjoka M et al (2010) Walking in Facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506

  8. 8.

    Ribeiro B, Towsley D (2012) On the estimation accuracy of degree distributions from graph sampling. In: CDC, pp 1–6

  9. 9.

    Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109

  10. 10.

    Graybill FA, Deal RB (1959) Combining unbiased estimators. Biometrics 15(4):543–550

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46

    Google Scholar 

  12. 12.

    Ribeiro B et al (2010) Multiple random walks to uncover short paths in power law networks. In: INFOCOM NetSciCom, pp 1–6

  13. 13.

    Roberts GO, Rosenthal JS (2004) General state space Markov chains and MCMC algorithms. Probab Surv 1:20–71

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Jones GL (2004) On the Markov chain central limit theorem. Probab Surv 1:299–320

    MathSciNet  Article  MATH  Google Scholar 

  15. 15.

    Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292

  16. 16.

    Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. JASA 47:663–685

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Lee CH et al (2012) Beyond random walk and Metropolis–Hastings samplers: Why you should not backtrack for unbiased graph sampling. In: SIGMETRICS/Performance, pp 319–330

  18. 18.

    Lim Y et al (2011) Online estimating the \(k\) central nodes of a network. In: NSW, pp 1–6

  19. 19.

    Cooper C et al (2012) A fast algorithm to find all high degree vertices in power law graphs. In: WWW LSNA, pp 1007–1016

  20. 20.

    Coppersmith D et al (1993) Random walks on weighted graphs, and applications to on-line algorithms (extended). J ACM 40:421–453

    Article  MATH  Google Scholar 

  21. 21.

    Maiya AS, Berger-Wolf TY (2010) Online sampling of high centrality individuals in social networks. In: PAKDD, pp 91–98

  22. 22.

    Maiya AS, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: SIGKDD, pp 105–113

  23. 23.

    Hui P et al (2008) BUBBLE Rap: social-based forwarding in delay tolerant networks. In: MobiHoc, pp 241–250

  24. 24.

    Ribeiro B et al (2012) Multiple random walks to uncover short paths in power law networks. In: Infocom NetSciCom, pp 1–6

  25. 25.

    Wang P et al (2012) Sampling contents distributed over graphs. Technical Report TR-1201, Xi’an Jiaotong University

  26. 26.

    Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42

  27. 27.

    Richardson M et al (2003) Trust management for the semantic web. In: ISWC, pp 351–368

  28. 28.

    Leskovec J et al (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700

  30. 30.

    Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 241–252

  31. 31.

    Kurant M et al (2011) Towards unbiased BFS sampling. JSAC 29(9):1799–1809

    Google Scholar 

  32. 32.

    Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34

    Article  Google Scholar 

  33. 33.

    Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 49(1):11–34

    Google Scholar 

  34. 34.

    Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390

    Google Scholar 

  35. 35.

    Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM Mini-conference, pp 2701–2705

Download references

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education & China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Junzhou Zhao.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Zhao, J., Ribeiro, B. et al. Practical characterization of large networks using neighborhood information. Knowl Inf Syst 58, 701–728 (2019). https://doi.org/10.1007/s10115-018-1167-0

Download citation

Keywords

  • Crawling
  • Graph sampling
  • Online social network
  • Random walk