Advertisement

Data Mining and Knowledge Discovery

, Volume 33, Issue 1, pp 24–57 | Cite as

Sampling online social networks by random walk with indirect jumps

  • Junzhou Zhao
  • Pinghui WangEmail author
  • John C. S. Lui
  • Don Towsley
  • Xiaohong Guan
Article

Abstract

Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected two-layered network structure. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.

Keywords

Graph sampling Random walk Markov chain Estimation theory 

Notes

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1301254, 61603290, 61602371, 61772412), the Ministry of Education&China Mobile Research Fund (MCM20160311), the Natural Science Foundation of Jiangsu Province (SBK2014021758), 111 International Collaboration Program of China, the Prospective Joint Research of Industry-Academia-Research Joint Innovation Funding of Jiangsu Province (BY2014074), Shenzhen Basic Research Grant (JCYJ20160229195940462, JCYJ20170816100819428), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034). The work by John C. S. Lui was supported in part by GRF 14208816.

References

  1. Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Proceedings of the 7th workshop on algorithms and models for the web graphGoogle Scholar
  2. Backstrom L, Kleinberg J (2014) Romantic partnerships and the dispersion of social ties: a network analysis of relationship status on Facebook. In: Proceedings of the 17th ACM conference on computer supported cooperative work and social computingGoogle Scholar
  3. Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512MathSciNetCrossRefzbMATHGoogle Scholar
  4. Birnbaum ZW, Sirken MG (1965) Design of sample surveys to estimate the prevalence of rare diseases: three unbiased estimates. Vital Health Stat 2(11):1–8Google Scholar
  5. Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  6. Gjoka M, Kurant M, Butts CT, Markopoulou A (2010) Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of the 29th annual IEEE international conference on computer communicationsGoogle Scholar
  7. Gjoka M, Butts CT, Kurant M, Markopoulou A (2011a) Multigraph sampling of online social networks. IEEE J Sel Areas Commun 29(9):1893–1905CrossRefGoogle Scholar
  8. Gjoka M, Kurant M, Butts CT, Markopoulou A (2011b) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892CrossRefGoogle Scholar
  9. Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263CrossRefGoogle Scholar
  10. Han J, Choi D, Chun BG, Kwon TT, Chul Kim H, Choi Y (2014) Collecting, organizing, and sharing pins in Pinterest: interest-driven or social-driven? In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation communityGoogle Scholar
  11. Hardiman SJ, Katzir L (2013) Estimating clustering coefficients and size of social networks via random walk. In: Proceeding of the 22nd international world wide web conferenceGoogle Scholar
  12. Katzir L, Liberty E, Somekh O (2011) Estimating sizes of social networks via biased sampling. In: Proceedings of the 19th international world wide web conferenceGoogle Scholar
  13. Lee CH, Xu X, Eun DY (2012) Beyond random walk and Metropolis–Hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation communityGoogle Scholar
  14. Lee CH, Xu X, Eun DY (2017) On the Rao–Blackwellization and its application for graph sampling via neighborhood exploration. In: Proceedings of the 36th annual IEEE international conference on computer communicationsGoogle Scholar
  15. Leskovec J, Huttenlocher D, Kleinberg J (2010) Signed networks in social media. In: Proceedings of the SIGCHI conference on human factors in computing systemsGoogle Scholar
  16. Li Y, Steiner M, Wang L, Zhang ZL, Bao J (2012) Dissecting foursquare venue popularity via random region sampling. In: Proceedings of the 8th international conference on emerging networking experiments and technologiesGoogle Scholar
  17. Li Y, Wang L, Steiner M, Bao J, Zhu T (2014) Region sampling and estimation of geosocial data with dynamic range calibration. In: Proceedings of the 30th IEEE international conference on data engineeringGoogle Scholar
  18. Li H, Ai W, Liu X, Tang J, Huang G, Feng F, Mei Q (2016) Voting with their feet: inferring user preferences from app management activities. In: Proceedings of the 25th international world wide web conferenceGoogle Scholar
  19. Massoulié L, Merrer EL, Kermarrec AM, Ganesh A (2006) Peer counting and sampling in overlay networks: random walk methods. In: Proceedings of ACM symposium on principles of distributed computingGoogle Scholar
  20. McAuley J, Pandey R, Leskovec J (2015) Inferring networks of substitutable and complementary products. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  21. Meyn S, Tweedie RL (2009) Markov Chains and statistic stability, 2nd edn. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  22. Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conferenceGoogle Scholar
  23. Mondal M, Viswanath B, Druschel P, Gummadi KP, Clement A, Mislove A, Post A (2012) Defending against large-scale crawls in online social networks. In: Proceedings of the 8th international conference on emerging networking experiments and technologiesGoogle Scholar
  24. Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conferenceGoogle Scholar
  25. Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: Proceedings of the 31st annual IEEE international conference on computer communicationsGoogle Scholar
  26. Robert CP, Casella G (2004) Monte Carlo statistic methods, 2nd edn. Springer, BerlinCrossRefzbMATHGoogle Scholar
  27. Seshadhri C, Pinar A, Kolda TG (2013) Triadic measures on graphs: the power of wedge sampling. In: Proceedings of the 13th SIAM international conference on data miningGoogle Scholar
  28. Sinclair A, Jerrum M (1989) Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82(1):93–133MathSciNetCrossRefzbMATHGoogle Scholar
  29. Wang P, He W, Liu X (2014a) An efficient sampling method for characterizing points of interests on maps. In: Proceedings of the 30th IEEE international conference on data engineeringGoogle Scholar
  30. Wang P, Lui JC, Ribeiro B, Towsley D, Zhao J, Guan X (2014b) Efficiently estimating motif statistics of large networks. ACM Trans Knowl Discov Data 9(2):1–27CrossRefGoogle Scholar
  31. Xu X, Lee CH, Eun DY (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of the 33rd annual IEEE international conference on computer communicationsGoogle Scholar
  32. Zhang B, Kreitz G, Isaksson M, Ubillos J, Urdaneta G, Pouwelse JA, Epema D (2013) Understanding user behavior in Spotify. In: Proceedings of the 32nd annual IEEE international conference on computer communicationsGoogle Scholar
  33. Zhao J, Lui JC, Towsley D, Wang P, Guan X (2015) A tale of three graphs: sampling design on hybrid social-affiliation networks. In: Proceedings of the 31st IEEE international conference on data engineeringGoogle Scholar
  34. Zhou Z, Zhang N, Gong Z, Das G (2013) Faster random walks by rewiring online social networks on-the-fly. In: Proceedings of the 29th IEEE international conference on data engineeringGoogle Scholar
  35. Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. In: Proceedings of the VLDB endowmentGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Junzhou Zhao
    • 1
  • Pinghui Wang
    • 2
    Email author
  • John C. S. Lui
    • 1
  • Don Towsley
    • 3
  • Xiaohong Guan
    • 2
  1. 1.The Chinese University of Hong KongHong KongChina
  2. 2.Xi’an Jiaotong UniversityXi’anChina
  3. 3.University of Massachusetts at AmherstAmherstUSA

Personalised recommendations