# Sampling online social networks by random walk with indirect jumps

## Abstract

Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected *two-layered network structure*. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.

## Keywords

Graph sampling Random walk Markov chain Estimation theory## Notes

### Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1301254, 61603290, 61602371, 61772412), the Ministry of Education&China Mobile Research Fund (MCM20160311), the Natural Science Foundation of Jiangsu Province (SBK2014021758), 111 International Collaboration Program of China, the Prospective Joint Research of Industry-Academia-Research Joint Innovation Funding of Jiangsu Province (BY2014074), Shenzhen Basic Research Grant (JCYJ20160229195940462, JCYJ20170816100819428), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034). The work by John C. S. Lui was supported in part by GRF 14208816.

## References

- Avrachenkov K, Ribeiro B, Towsley D (2010) Improving random walk estimation accuracy with uniform restarts. In: Proceedings of the 7th workshop on algorithms and models for the web graphGoogle Scholar
- Backstrom L, Kleinberg J (2014) Romantic partnerships and the dispersion of social ties: a network analysis of relationship status on Facebook. In: Proceedings of the 17th ACM conference on computer supported cooperative work and social computingGoogle Scholar
- Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512MathSciNetCrossRefzbMATHGoogle Scholar
- Birnbaum ZW, Sirken MG (1965) Design of sample surveys to estimate the prevalence of rare diseases: three unbiased estimates. Vital Health Stat 2(11):1–8Google Scholar
- Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
- Gjoka M, Kurant M, Butts CT, Markopoulou A (2010) Walking in Facebook: a case study of unbiased sampling of OSNs. In: Proceedings of the 29th annual IEEE international conference on computer communicationsGoogle Scholar
- Gjoka M, Butts CT, Kurant M, Markopoulou A (2011a) Multigraph sampling of online social networks. IEEE J Sel Areas Commun 29(9):1893–1905CrossRefGoogle Scholar
- Gjoka M, Kurant M, Butts CT, Markopoulou A (2011b) Practical recommendations on crawling online social networks. IEEE J Sel Areas Commun 29(9):1872–1892CrossRefGoogle Scholar
- Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263CrossRefGoogle Scholar
- Han J, Choi D, Chun BG, Kwon TT, Chul Kim H, Choi Y (2014) Collecting, organizing, and sharing pins in Pinterest: interest-driven or social-driven? In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation communityGoogle Scholar
- Hardiman SJ, Katzir L (2013) Estimating clustering coefficients and size of social networks via random walk. In: Proceeding of the 22nd international world wide web conferenceGoogle Scholar
- Katzir L, Liberty E, Somekh O (2011) Estimating sizes of social networks via biased sampling. In: Proceedings of the 19th international world wide web conferenceGoogle Scholar
- Lee CH, Xu X, Eun DY (2012) Beyond random walk and Metropolis–Hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the ACM special interest group (SIG) for the computer systems performance evaluation communityGoogle Scholar
- Lee CH, Xu X, Eun DY (2017) On the Rao–Blackwellization and its application for graph sampling via neighborhood exploration. In: Proceedings of the 36th annual IEEE international conference on computer communicationsGoogle Scholar
- Leskovec J, Huttenlocher D, Kleinberg J (2010) Signed networks in social media. In: Proceedings of the SIGCHI conference on human factors in computing systemsGoogle Scholar
- Li Y, Steiner M, Wang L, Zhang ZL, Bao J (2012) Dissecting foursquare venue popularity via random region sampling. In: Proceedings of the 8th international conference on emerging networking experiments and technologiesGoogle Scholar
- Li Y, Wang L, Steiner M, Bao J, Zhu T (2014) Region sampling and estimation of geosocial data with dynamic range calibration. In: Proceedings of the 30th IEEE international conference on data engineeringGoogle Scholar
- Li H, Ai W, Liu X, Tang J, Huang G, Feng F, Mei Q (2016) Voting with their feet: inferring user preferences from app management activities. In: Proceedings of the 25th international world wide web conferenceGoogle Scholar
- Massoulié L, Merrer EL, Kermarrec AM, Ganesh A (2006) Peer counting and sampling in overlay networks: random walk methods. In: Proceedings of ACM symposium on principles of distributed computingGoogle Scholar
- McAuley J, Pandey R, Leskovec J (2015) Inferring networks of substitutable and complementary products. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
- Meyn S, Tweedie RL (2009) Markov Chains and statistic stability, 2nd edn. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
- Mohaisen A, Yun A, Kim Y (2010) Measuring the mixing time of social graphs. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conferenceGoogle Scholar
- Mondal M, Viswanath B, Druschel P, Gummadi KP, Clement A, Mislove A, Post A (2012) Defending against large-scale crawls in online social networks. In: Proceedings of the 8th international conference on emerging networking experiments and technologiesGoogle Scholar
- Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement conferenceGoogle Scholar
- Ribeiro B, Wang P, Murai F, Towsley D (2012) Sampling directed graphs with random walks. In: Proceedings of the 31st annual IEEE international conference on computer communicationsGoogle Scholar
- Robert CP, Casella G (2004) Monte Carlo statistic methods, 2nd edn. Springer, BerlinCrossRefzbMATHGoogle Scholar
- Seshadhri C, Pinar A, Kolda TG (2013) Triadic measures on graphs: the power of wedge sampling. In: Proceedings of the 13th SIAM international conference on data miningGoogle Scholar
- Sinclair A, Jerrum M (1989) Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82(1):93–133MathSciNetCrossRefzbMATHGoogle Scholar
- Wang P, He W, Liu X (2014a) An efficient sampling method for characterizing points of interests on maps. In: Proceedings of the 30th IEEE international conference on data engineeringGoogle Scholar
- Wang P, Lui JC, Ribeiro B, Towsley D, Zhao J, Guan X (2014b) Efficiently estimating motif statistics of large networks. ACM Trans Knowl Discov Data 9(2):1–27CrossRefGoogle Scholar
- Xu X, Lee CH, Eun DY (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of the 33rd annual IEEE international conference on computer communicationsGoogle Scholar
- Zhang B, Kreitz G, Isaksson M, Ubillos J, Urdaneta G, Pouwelse JA, Epema D (2013) Understanding user behavior in Spotify. In: Proceedings of the 32nd annual IEEE international conference on computer communicationsGoogle Scholar
- Zhao J, Lui JC, Towsley D, Wang P, Guan X (2015) A tale of three graphs: sampling design on hybrid social-affiliation networks. In: Proceedings of the 31st IEEE international conference on data engineeringGoogle Scholar
- Zhou Z, Zhang N, Gong Z, Das G (2013) Faster random walks by rewiring online social networks on-the-fly. In: Proceedings of the 29th IEEE international conference on data engineeringGoogle Scholar
- Zhou Z, Zhang N, Das G (2015) Leveraging history for faster sampling of online social networks. In: Proceedings of the VLDB endowmentGoogle Scholar