Skip to main content

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Abstract

Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. http://www.columbia.edu/~ks20/4703-Sigman/4703-07-Notes-ARM.pdf

  2. Note that the coefficient c is incurred by finding the edge id between previous node u and current node v to access the group information.

  3. https://www.openmp.org/

  4. http://snap.stanford.edu/data/soc-LiveJournal1.html

  5. https://an.kaist.ac.kr/traces/WWW2010.html

  6. http://law.di.unimi.it/webdata/uk-2007-05/

  7. http://socialcomputing.asu.edu/datasets

  8. Note that the minimal memory of rejection method is different from the one in our conference version, because we store the number of common neighbors of edges in memory for fast computing the exact bounding constant, thus improving the efficiency of rejection method over billion-edge graphs.

  9. https://www.mindspore.cn/

References

  1. Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012)

  2. Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)

    Article  Google Scholar 

  3. Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)

  4. Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)

    Article  Google Scholar 

  5. Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)

    Article  Google Scholar 

  6. Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)

    MathSciNet  Article  Google Scholar 

  7. Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018)

  8. Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)

    MATH  Google Scholar 

  9. Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016)

  10. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017)

  11. He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)

  12. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)

    Article  Google Scholar 

  13. Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013)

  14. Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014)

  15. Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013)

  16. Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)

    MATH  Google Scholar 

  17. Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)

    MathSciNet  Article  Google Scholar 

  18. Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015)

  19. Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)

    Article  Google Scholar 

  20. Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003)

  21. Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014)

  22. Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)

    Article  Google Scholar 

  23. Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019)

  24. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)

  25. Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)

    Article  Google Scholar 

  26. Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)

    Article  Google Scholar 

  27. Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)

    Article  Google Scholar 

  28. Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)

    MathSciNet  Google Scholar 

  29. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014)

  30. Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)

    MathSciNet  Article  Google Scholar 

  31. Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)

    MathSciNet  MATH  Google Scholar 

  32. Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)

    MATH  Google Scholar 

  33. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)

    Book  Google Scholar 

  34. Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016)

    Google Scholar 

  35. Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019)

  36. Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)

    Article  Google Scholar 

  37. Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014)

  38. Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020)

  39. Sinha, P., Zoltners, A.A.: The multiple-choice knapsack problem. Op. Res. 27(3), 503–515 (1979)

    MathSciNet  Article  Google Scholar 

  40. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  41. Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018)

  42. Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)

    Article  Google Scholar 

  43. Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)

    Article  Google Scholar 

  44. Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020)

  45. Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)

    Article  Google Scholar 

  46. Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016)

  47. Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019)

  48. Zemel, E.: The linear multiple choice knapsack problem. Op. Res. 28(6), 1412–1423 (1980)

    MathSciNet  Article  Google Scholar 

  49. Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)

    Article  Google Scholar 

  50. Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2018YFB140 2600), NSFC (Nos. U1936104, 61902037, 61832001), CAAI-Huawei MindSpore Open Fund, Beijing Academy of Artificial Intelligence (BAAI), PKU-Baidu Fund 2019BD006, the Fundamental Research Funds for the Central Universities 2020RC25. Lei Chen’s work is partially supported by National Key Research and Development Program of China Grant No. 2018AAA0101100, the Hong Kong RGC GRF Project 16202218, CRF Project C6030-18G, C1031-18G, C5026-18G, AOE Project AoE/E-603/18, Theme-based project TRS T41-603/20R, China NSFC No. 61729201, Guangdong Basic and Applied Basic Research Foundation 2019B151530001, Hong Kong ITC ITF grants ITS/044/18FX and ITS/470/18FX, Microsoft Research Asia Collaborative Research Grant, Didi-HKUST joint research lab project, and Wechat and Webank Research Grants.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yawen Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

The proof of proposition 1

Proof

Let G(VE) be an unweighted graph G(VE), u and v be the previous node and current node. Considering that the nodes in the same group have the same probabilities, we simply use \(p_i, i=1..3\), to denote the e2e probability of a node in the ith group, and \(p_i^g, i=1..3\) to denote the probability of the ith group.

According to the definition of groups, we have \(|G_I|=1, |G_II|=\theta _{uv}, |G_{III}|=d_v-1-\theta _{uv}\). Therefore, \(p_{i}^g=\sum _{1}^{|G_i|}p_i\).

Based on the analysis in Section 5.1, the time cost of rejection node sampler \(T_r\) is \(C_vcK\), and the time cost of group-based node sampler \(T_g\) is \((1+p_1^gc+p_2^gC_{v}^{g_2}c+ p_3^gC_{v}^{g_3}c)K\).

In the context of unweighted graph, we have \(C_v = d_v max\{\) \(p_1, p_2, p_3\}\), \(C_v^{g_2}=1\), \(C_v^{g_3}=\frac{d_v}{d_v-\theta _{uv}-1}\).

To derive the condition of \(T_r > T_g\), we should have

$$\begin{aligned}&T_r-T_g = C_vcK-(1+p_1^gc+p_2^gC_{v}^{g_2}c+ p_3^{g}C_{v}^{g_3}c)K \\&\quad = (d_v max_{i=1..3}\{p_i\}c - (1+p_1c+p_2\theta _{uv}c+ p_3d_vc))K \\&\quad = (d_v (max_{i=1..3}\{p_i\}-p_3)c - (1+p_1c+p_2\theta _{uv}c))K \\&\quad > 0 \end{aligned}$$

Therefore, the above inequation holds when

$$\begin{aligned} d_v (max_{i=1..3}\{p_i\}-p_3) - (p_1+p_2\theta _{uv}) > \frac{1}{c} \end{aligned}$$

is satisfied. And the proposition is proved. \(\square \)

Table 10 The distribution of types of node samplers and the concrete node samplers for the nodes with top-10 largest degrees when running NV(0.25, 4) with different greedy algorithms over Youtube. The values in parentheses are the average degree for the nodes having the same node sampler. N: Naive, R: Rejection, A: Alias

LP-domination analysis

In this section, we show that there is no LP domination among the alias, rejection and naive sampling methods. Here, we give the proof with a common setting \(d_f=4\), \(d_i=4\), \(c=1\).

Proof

Following the cost model in Table 2. To prove no LP-domination among the three sampling methods, we need to show that \(\frac{T_{r}-T_{n}}{M_{r}-M_{n}}-\frac{T_{a}-T_{r}}{M_{a}-M_{r}}\le 0\) holds.

$$\begin{aligned}&\frac{T_{r}-T_{n}}{M_{r}-M_{n}}-\frac{T_{a}-T_{r}}{M_{a}-M_{r}}\\&\quad =\frac{C_vcK-d_v(c+1)K}{(2b_f+b_i)d_v-M_{n}}-\frac{K-C_vcK}{(b_f+b_i)d^2_v-b_fd_v}\\&\quad =\frac{C_vK-2d_vK}{12d_v-M_{n}}-\frac{K-C_vK}{8d^2_v-4d_v} \\&\quad =K\frac{(C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n)}{(12d_v-M_n)(8d_v^2-4d_v)} \end{aligned}$$

Let \(0<M_n=\frac{b_fd_{max}}{|V|}<b_f=4\) and \(C_v\le d_v\), it is easy to figure out \((12d_v-M_n)(8d_v^2-4d_v) > 0\) when \(d_v \ge 1\). Then, we only need to compute the bound of \((C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n)\) as below:

$$\begin{aligned}&(C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n) \\&\quad \text {//let~} C_v=d_v\\&\quad \le -d_v(8d_v^2-4d_v)-(1-d_v)(12d_v-M_n) \\&\quad =-8d_v^3+16d_v^2-12d_v+M_n-d_vM_n \\&\quad \text {//let~} M_n=4~\text {and~omit}~-d_vM_v\\&\quad < -8d_v^3+16d_v^2-12d_v+4 ~~~~~~~\text {//}(d_v \ge 1)\\&\quad \le 0 . \end{aligned}$$

\(\square \)

Analysis about the results of Deg-inc on Youtube

In Fig. 8a, b, when memory budget is larger than 7.5 GB, Dec-inc has similar performance to the LP-std and LP-est on Youtube. To clearly analyze the reasons behind this results, we take Fig. 8a as an example and profile the distribution of types of node samplers. And we also give the concrete node samplers for the nodes with top-10 largest degrees. The statistics are reported in Table 10. From the table, we clearly see that when memory budget is 7.5 GB, the distribution of types of node samplers are all most the same between LP-std and Deg-inc. After checking the complete node sampler assignment, we find only two nodes have different node samplers. Recall that Deg-inc processes the nodes with small degree first, due to the sparsity of Youtube, even all the nodes with small degrees are assigned alias method, there are enough memory budget left which allows nodes with large degrees to use rejection method. But when memory budget is 2.5 GB, nodes with large degrees are assigned naive node sampler by Deg-inc, resulting poor efficiency. Unlike Deg-inc, Deg-dec is able to assign alias method or rejection method to nodes with large degrees no matter memory budget is 2.5 GB or 7.5 GB. However, Deg-dec always processes the largest nodes first, thus consuming a lot of memory budget. Finally, Deg-dec leads to many other nodes using naive method, and the average degree of naive method for Deg-dec in Table 10 implicitly demonstrates such node sampler assignment.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shao, Y., Huang, S., Li, Y. et al. Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs. The VLDB Journal 30, 769–797 (2021). https://doi.org/10.1007/s00778-021-00669-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00669-2

Keywords

  • Random walk
  • Memory efficient
  • Graph algorithm
  • Large-scale