Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Shao, Yingxia; Huang, Shiyue; Li, Yawen; Miao, Xupeng; Cui, Bin; Chen, Lei

doi:10.1007/s00778-021-00669-2

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Regular Paper
Published: 07 May 2021

Volume 30, pages 769–797, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yingxia Shao ORCID: orcid.org/0000-0002-8559-2628¹,
Shiyue Huang³,
Yawen Li²,
Xupeng Miao³,
Bin Cui³ &
…
Lei Chen⁴

725 Accesses
20 Citations
Explore all metrics

Abstract

Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 12

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

A survey of density based clustering algorithms

Article 29 September 2020

Panthadeep Bhattacharjee & Pinaki Mitra

Using Euler’s Formula to Find the Lower Bound of the Page Number

Article 03 April 2024

Bin Zhao, Peng Li, … Yuepeng Zhang

Notes

http://www.columbia.edu/~ks20/4703-Sigman/4703-07-Notes-ARM.pdf
Note that the coefficient c is incurred by finding the edge id between previous node u and current node v to access the group information.
https://www.openmp.org/
http://snap.stanford.edu/data/soc-LiveJournal1.html
https://an.kaist.ac.kr/traces/WWW2010.html
http://law.di.unimi.it/webdata/uk-2007-05/
http://socialcomputing.asu.edu/datasets
Note that the minimal memory of rejection method is different from the one in our conference version, because we store the number of common neighbors of edges in memory for fast computing the exact bounding constant, thus improving the efficiency of rejection method over billion-edge graphs.
https://www.mindspore.cn/

References

Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012)
Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)
Article Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)
Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)
Article Google Scholar
Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)
Article Google Scholar
Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)
Article MathSciNet Google Scholar
Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018)
Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)
MATH Google Scholar
Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016)
Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017)
He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)
Article Google Scholar
Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013)
Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014)
Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013)
Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)
MATH Google Scholar
Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)
Article MathSciNet Google Scholar
Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015)
Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)
Article Google Scholar
Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003)
Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)
Article Google Scholar
Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)
Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)
Article Google Scholar
Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)
Article Google Scholar
Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)
Article Google Scholar
Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)
MathSciNet Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014)
Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)
Article MathSciNet Google Scholar
Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)
MathSciNet MATH Google Scholar
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)
MATH Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)
Book Google Scholar
Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016)
Google Scholar
Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019)
Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)
Article Google Scholar
Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014)
Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020)
Sinha, P., Zoltners, A.A.: The multiple-choice knapsack problem. Op. Res. 27(3), 503–515 (1979)
Article MathSciNet Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)
Article Google Scholar
Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)
Article Google Scholar
Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020)
Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)
Article Google Scholar
Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016)
Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019)
Zemel, E.: The linear multiple choice knapsack problem. Op. Res. 28(6), 1412–1423 (1980)
Article MathSciNet Google Scholar
Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)
Article Google Scholar
Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2018YFB140 2600), NSFC (Nos. U1936104, 61902037, 61832001), CAAI-Huawei MindSpore Open Fund, Beijing Academy of Artificial Intelligence (BAAI), PKU-Baidu Fund 2019BD006, the Fundamental Research Funds for the Central Universities 2020RC25. Lei Chen’s work is partially supported by National Key Research and Development Program of China Grant No. 2018AAA0101100, the Hong Kong RGC GRF Project 16202218, CRF Project C6030-18G, C1031-18G, C5026-18G, AOE Project AoE/E-603/18, Theme-based project TRS T41-603/20R, China NSFC No. 61729201, Guangdong Basic and Applied Basic Research Foundation 2019B151530001, Hong Kong ITC ITF grants ITS/044/18FX and ITS/470/18FX, Microsoft Research Asia Collaborative Research Grant, Didi-HKUST joint research lab project, and Wechat and Webank Research Grants.

Author information

Authors and Affiliations

School of Computer Science (National Pilot Software Engineering School) & Beijing Key Lab of Intelligent Telecommunications Software and Multimedia, BUPT, Beijing, China
Yingxia Shao
School of Economics and Management, BUPT, Beijing, China
Yawen Li
Department of Computer Science and Technology & Key Laboratory of High Confidence Software Technologies (MOE), Peking University, Beijing, China
Shiyue Huang, Xupeng Miao & Bin Cui
Department of Computer Science and Engineering, HKUST Hong Kong, China
Lei Chen

Authors

Yingxia Shao
View author publications
You can also search for this author in PubMed Google Scholar
Shiyue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yawen Li
View author publications
You can also search for this author in PubMed Google Scholar
Xupeng Miao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yawen Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

The proof of proposition 1

Proof

Let G(V, E) be an unweighted graph G(V, E), u and v be the previous node and current node. Considering that the nodes in the same group have the same probabilities, we simply use $p_i, i=1..3$, to denote the e2e probability of a node in the ith group, and $p_i^g, i=1..3$ to denote the probability of the ith group.

According to the definition of groups, we have $|G_I|=1, |G_II|=\theta _{uv}, |G_{III}|=d_v-1-\theta _{uv}$. Therefore, $p_{i}^g=\sum _{1}^{|G_i|}p_i$.

Based on the analysis in Section 5.1, the time cost of rejection node sampler $T_r$ is $C_vcK$, and the time cost of group-based node sampler $T_g$ is $(1+p_1^gc+p_2^gC_{v}^{g_2}c+ p_3^gC_{v}^{g_3}c)K$.

In the context of unweighted graph, we have $C_v = d_v max\{$ $p_1, p_2, p_3\}$, $C_v^{g_2}=1$, $C_v^{g_3}=\frac{d_v}{d_v-\theta _{uv}-1}$.

To derive the condition of $T_r > T_g$, we should have

$$\begin{aligned}&T_r-T_g = C_vcK-(1+p_1^gc+p_2^gC_{v}^{g_2}c+ p_3^{g}C_{v}^{g_3}c)K \\&\quad = (d_v max_{i=1..3}\{p_i\}c - (1+p_1c+p_2\theta _{uv}c+ p_3d_vc))K \\&\quad = (d_v (max_{i=1..3}\{p_i\}-p_3)c - (1+p_1c+p_2\theta _{uv}c))K \\&\quad > 0 \end{aligned}$$

Therefore, the above inequation holds when

$$\begin{aligned} d_v (max_{i=1..3}\{p_i\}-p_3) - (p_1+p_2\theta _{uv}) > \frac{1}{c} \end{aligned}$$

is satisfied. And the proposition is proved. $\square $

Table 10 The distribution of types of node samplers and the concrete node samplers for the nodes with top-10 largest degrees when running NV(0.25, 4) with different greedy algorithms over Youtube. The values in parentheses are the average degree for the nodes having the same node sampler. N: Naive, R: Rejection, A: Alias

Full size table

LP-domination analysis

In this section, we show that there is no LP domination among the alias, rejection and naive sampling methods. Here, we give the proof with a common setting $d_f=4$, $d_i=4$, $c=1$.

Proof

Following the cost model in Table 2. To prove no LP-domination among the three sampling methods, we need to show that $\frac{T_{r}-T_{n}}{M_{r}-M_{n}}-\frac{T_{a}-T_{r}}{M_{a}-M_{r}}\le 0$ holds.

$$\begin{aligned}&\frac{T_{r}-T_{n}}{M_{r}-M_{n}}-\frac{T_{a}-T_{r}}{M_{a}-M_{r}}\\&\quad =\frac{C_vcK-d_v(c+1)K}{(2b_f+b_i)d_v-M_{n}}-\frac{K-C_vcK}{(b_f+b_i)d^2_v-b_fd_v}\\&\quad =\frac{C_vK-2d_vK}{12d_v-M_{n}}-\frac{K-C_vK}{8d^2_v-4d_v} \\&\quad =K\frac{(C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n)}{(12d_v-M_n)(8d_v^2-4d_v)} \end{aligned}$$

Let $0<M_n=\frac{b_fd_{max}}{|V|}<b_f=4$ and $C_v\le d_v$, it is easy to figure out $(12d_v-M_n)(8d_v^2-4d_v) > 0$ when $d_v \ge 1$. Then, we only need to compute the bound of $(C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n)$ as below:

$$\begin{aligned}&(C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n) \\&\quad \text {//let~} C_v=d_v\\&\quad \le -d_v(8d_v^2-4d_v)-(1-d_v)(12d_v-M_n) \\&\quad =-8d_v^3+16d_v^2-12d_v+M_n-d_vM_n \\&\quad \text {//let~} M_n=4~\text {and~omit}~-d_vM_v\\&\quad < -8d_v^3+16d_v^2-12d_v+4 ~~~~~~~\text {//}(d_v \ge 1)\\&\quad \le 0 . \end{aligned}$$

$\square $

Analysis about the results of Deg-inc on Youtube

In Fig. 8a, b, when memory budget is larger than 7.5 GB, Dec-inc has similar performance to the LP-std and LP-est on Youtube. To clearly analyze the reasons behind this results, we take Fig. 8a as an example and profile the distribution of types of node samplers. And we also give the concrete node samplers for the nodes with top-10 largest degrees. The statistics are reported in Table 10. From the table, we clearly see that when memory budget is 7.5 GB, the distribution of types of node samplers are all most the same between LP-std and Deg-inc. After checking the complete node sampler assignment, we find only two nodes have different node samplers. Recall that Deg-inc processes the nodes with small degree first, due to the sparsity of Youtube, even all the nodes with small degrees are assigned alias method, there are enough memory budget left which allows nodes with large degrees to use rejection method. But when memory budget is 2.5 GB, nodes with large degrees are assigned naive node sampler by Deg-inc, resulting poor efficiency. Unlike Deg-inc, Deg-dec is able to assign alias method or rejection method to nodes with large degrees no matter memory budget is 2.5 GB or 7.5 GB. However, Deg-dec always processes the largest nodes first, thus consuming a lot of memory budget. Finally, Deg-dec leads to many other nodes using naive method, and the average degree of naive method for Deg-dec in Table 10 implicitly demonstrates such node sampler assignment.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, Y., Huang, S., Li, Y. et al. Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs. The VLDB Journal 30, 769–797 (2021). https://doi.org/10.1007/s00778-021-00669-2

Download citation

Received: 09 September 2020
Revised: 02 February 2021
Accepted: 10 April 2021
Published: 07 May 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00778-021-00669-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

Using Euler’s Formula to Find the Lower Bound of the Page Number

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

The proof of proposition 1

Proof

LP-domination analysis

Proof

Analysis about the results of Deg-inc on Youtube

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

Using Euler’s Formula to Find the Lower Bound of the Page Number

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

The proof of proposition 1

Proof

LP-domination analysis

Proof

Analysis about the results of Deg-inc on Youtube

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation