Skip to main content
Log in

SNOD: a fast sampling method of exploring node orbit degrees for large graphs

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Exploring small connected and induced subgraph patterns (CIS patterns, or graphlets) has recently attracted considerable attention. Despite recent efforts on computing how frequent a graphlet appears in a large graph (i.e., the total number of CISes isomorphic to the graphlet), little effort has been made to characterize a node’s graphlet orbit degree, i.e., the number of CISes isomorphic to the graphlet that touch the node at a particular orbit, which is an important fine-grained metric for analyzing complex networks such as learning functions/roles of nodes in social and biological networks. Like global graphlet counting, it is computationally intensive to compute node orbit degrees for a large graph. Furthermore, previous methods of computing global graphlet counts are not suited to solve this problem. In this paper, we propose a novel sampling method SNOD to efficiently estimate node orbit degrees for large-scale graphs and quantify the error of our estimates. To the best of our knowledge, we are the first to study this problem and give a fast scalable solution. We conduct experiments on a variety of real-world datasets and demonstrate that our method SNOD is several orders of magnitude faster than state-of-the-art enumeration methods for accurately estimating node orbit degrees for graphs with millions of edges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The values of orbit IDs in Fig. 2 have no specific meaning. We set the values of orbit IDs same as [27].

  2. www.snap.stanford.edu.

  3. The concentration of a particular k-node graphlet in a network refers to the ratio of the graphlet count to the total number of k-node CISes in the network, \(k=3, 4, 5, \ldots \).

  4. Streaming graph is given in form of a stream of edges.

References

  1. Ahmed NK, Neville J, Rossi RA, Duffield N (2015) Efficient graphlet counting for large networks. In: ICDM

  2. Ahmed N, Duffield N, Neville J, Kompella R (2014) Graph sample and hold: a framework for big-graph analytics. In: KDD, pp 589–597

  3. Alon N, Yuster R, Zwick U (1995) Color-coding. J ACM 42(4):844–856. https://doi.org/10.1145/210332.210337

    Article  MathSciNet  MATH  Google Scholar 

  4. Aparicio DO, Ribeiro PMP, da Silva FMA (2014) Parallel subgraph counting for multicore architectures. In: ISPA, pp 34–41

  5. Benson AR, Gleich DF, Leskovec J (2016) Higher-order organization of complex networks. Science 353(6295):163–166

    Article  Google Scholar 

  6. Bhuiyan MA, Rahman M, Rahman M, Hasan MA (2012) Guise: Uniform sampling of graphlets for large graph analysis. In: ICDM, pp 91–100

  7. Chu S, Cheng J (2011) Triangle listing in massive networks and its applications. In: KDD, pp 672–680

  8. Dave VS, Ahmed NK, Hasan MH (2017) E-CLoG: counting edge-centric local graphlets. In: BigData, pp 586–595

  9. Elenberg ER, Shanmugam K, Borokhovich M, Dimakis AG (2015) Beyond triangles: a distributed framework for estimating 3-profiles of large graphs. In: KDD, pp 229–238

  10. Elenberg ER, Shanmugam K, Borokhovich M, Dimakis AG (2016) Distributed estimation of graph 4-profiles. In: WWW

  11. Fang M, Yin J, Zhu X, Zhang C (2015) Trgraph: cross-network transfer learning via common signature subgraphs. TKDE 27(9):2536–2549

    Google Scholar 

  12. Google programming contest. http://www.google.com/programming-contest/ (2002)

  13. Graybill FA, Deal RB (1959) Combining unbiased estimators. Biometrics 15(4):543–550

    Article  MathSciNet  MATH  Google Scholar 

  14. Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: KDD

  15. Henderson K, Gallagher B, Eliassi-Rad T, Tong H, Basu S, Akoglu L, Koutra D, Faloutsos C, Li L (2012) Rolx: structural role extraction and mining in large graphs. In: KDD, pp 1231–1239

  16. Jha M, Seshadhri C, Pinar A (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: KDD, pp 589–597

  17. Jha M, Seshadhri C, Pinar A (2015) Path sampling: a fast and provable method for estimating 4-vertex subgraph counts. In: WWW, pp 495–505

  18. Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758

    Article  Google Scholar 

  19. Leskovec J, Huttenlocher D, Kleinberg J (2010) Predicting positive and negative links in online social networks. In: WWW, pp 641–650

  20. Marcus D, Shavitt Y (2012) Rage: a rapid graphlet enumerator for large networks. Comput Netw 56(2):810–819

    Article  Google Scholar 

  21. Milenkovic T, Przulj N (2008) Uncovering biological network function via graphlet degree signatures. Cancer Inform 6:257–273

    Article  Google Scholar 

  22. Milenkovic T, Memisevic V, Ganesan AK, Przulj N (2010) Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data. J R Soc Interface 7(44):423–437

    Article  Google Scholar 

  23. Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42

  24. Omidi S, Schreiber F, Masoudi-nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genet Syst 84(5):385–395

    Article  Google Scholar 

  25. Pavany A, Tangwongsan K, Tirthapuraz S, Wu KL (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881

  26. Pinar A, Seshadhr C, Visha V (2017) Escape: efficiently counting all 5-vertex subgraphs. In: WWW

  27. Przulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23(2):177–183

    Article  Google Scholar 

  28. Rahman M, Bhuiyan M, Hasan MA (2012) Graft: an approximate graphlet counting algorithm for large graph analysis. In: CIKM

  29. Rossi RA, Zhou R, Ahmed NK (2017) Estimation of graphlet statistics. CoRR abs/1701.01772

  30. Schank T (2007) Algorithmic aspects of triangle-based network analysis. Ph.D. in Computer Science

  31. Seshadhri C, Pinar A, Kolda TG (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307

    Article  MathSciNet  Google Scholar 

  32. Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp 625–636

  33. Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: WWW, pp 607–614

  34. Takac L, Zabovsky M (2012) Data analysis in public social networks. In: International scientific conference and international workshop present day trends of innovations, pp 1–6

  35. Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) Doulion: Counting triangles in massive graphs with a coin. In: KDD

  36. Wang P, Lui JC, Zhao J, Ribeiro B, Towsley D, Guan X (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27

    Article  Google Scholar 

  37. Wang P, Lui JCS, Towsley D (2015) Minfer: inferring motif statistics from sampled edges. In: ICDE

  38. Wang P, Tao J, Zhao J, Guan X (2015) Moss: a scalable tool for efficiently sampling and counting 4- and 5-node graphlets. CoRR abs/1509.08089. arXiv:1509.08089

  39. Wei B, Liu J, Ma J, Zheng Q, Zhang W, Feng B (2014) Motif-based hyponym relation extraction from wikipedia hyperlinks. TKDE 26(10):2507–2519

    Google Scholar 

  40. Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359

    Article  Google Scholar 

  41. Ye J, Cheng H, Zhu Z, Chen M (2013) Predicting positive and negative links in signed social networks by transfer learning. In: WWW, pp 1477–1488

Download references

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1301254, 61603290, 61602371), the Ministry of Education&China Mobile Research Fund (MCM20160311), 111 International Collaboration Program of China, China Postdoctoral Science Foundation (2015M582663), Shenzhen Basic Research Grant (JCYJ20160229195940462, JCYJ20170816100819428), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Junzhou Zhao or Jing Tao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Implementation details

We discuss our methods for implementing the functions in Algorithms 1,  2, and 3. We also analyze their computational complexities.

Initialization of\(\phi _v\), \(\varphi _v\), \(\check{\varphi }_v\), \(\tilde{\varphi }_v\), \(\psi _v\), and\(\gamma _v\): For each node v, we store its degree \(d_v\) and its neighbors’ degrees in a list. Therefore, O(1) and \(O(d_v)\) operations are required to compute \(\phi _v\) and \(\varphi _v\), respectively. Similarly, one can easily find that \(O(N_v)\), \(O(N_v)\), \(O(\sum _{u\in N_v} d_u)\), and O(1) operations are required to compute \(\check{\varphi }_v\), \(\tilde{\varphi }_v\), \(\psi _v\), and \(\gamma _v\), respectively.

\(\mathbf RandomVertex (N_v- \{u\})\): We use an array \(N_v[1, \ldots , d_v]\) to store the neighbors of v. Let \(i_{v,u}\) denote the index of u in the list \(N_v[1, \ldots , d_v]\), i.e., \(N_v[i_{v,u}]=u\). Then, function \(\text {RandomVertex}(N_v- \{u\})\) includes the following steps:

  • Step 1. Select a number z from \(\{1, \ldots , d_v\}-\{i_{v,u}\}\) at random;

  • Step 2. Return \(N_v[z]\).

Therefore, the computational complexity of \(\text {RandomVertex}(N_v- \{u\})\) is O(1).

\(\mathbf RandomVertex (N_v- \{u, w\})\): Similarly, it includes the following steps:

  • Step 1. Select a number z from \(\{1, \ldots , d_v\}- \{i_{v,u}, i_{v,w}\}\) at random;

  • Step 2. Return \(N_v[z]\).

Therefore, the computational complexity of \(\text {RandomVertex}(N_v- \{u, w\})\) is O(1).

\(\mathbf WeightRandomVertex (N_v, \varvec{\alpha }^{(v)})\): We store an array \(A_\alpha ^{(v)}\) in memory, where \(A_\alpha ^{(v)}[i]\) is defined as \(A_\alpha ^{(v)}[i] = \sum _{j=1}^i (d_{N_v[j]} - 1)\), \(1\le i\le d_v\). Let \(A_\alpha ^{(v)}[0]=0\). Then, \(\text {WeightRandomVertex}(N_v, \varvec{\alpha }^{(v)})\) includes the following steps:

  • Step 1. Select a number z from \(\{1, \ldots , A_\alpha ^{(v)}[d_v]\}\) at random;

  • Step 2. Find i such that

    $$\begin{aligned} A_\alpha ^{(v)}[i-1] < z \le A_\alpha ^{(v)}[i], \end{aligned}$$

    which is solved by binary search;

  • Step 3. Return \(N_v[i]\).

Therefore, the computational complexity of function \(\text {WeightRandomVertex}(N_v, \varvec{\alpha }^{(v)})\) is \(O(\log d_v)\).

\(\mathbf WeightRandomVertex (N_v, \varvec{\beta }^{(v)})\): We store an array \(A_\beta ^{(v)}\) in memory, where \(A_\beta ^{(v)}[i]\) is defined as

$$\begin{aligned} A_\beta ^{(v)}[i] = \sum _{j=1}^i (\phi _{N_v[j]} - d_{N_v[j]} + 1), \quad 1\le i\le d_v. \end{aligned}$$

Let \(A_\beta ^{(v)}[0]=0\). Then, \(\text {WeightRandomVertex}(N_v, \varvec{\beta }^{(v)})\) includes the following steps:

  • Step 1. Select a number z from \(\{1, \ldots , A_\beta ^{(v)}[d_v]\}\) at random;

  • Step 2. Find i such that

    $$\begin{aligned} A_\beta ^{(v)}[i-1] < z \le A_\beta ^{(v)}[i], \end{aligned}$$

    which again is solved by binary search;

  • Step 3. Return \(N_v[i]\).

Therefore, the computational complexity of function \(\text {WeightRandomVertex}(N_v, \varvec{\beta }^{(v)})\) is \(O(\log d_v)\).

1.2 Proof of Theorem 7

According to Theorems 1 and 4, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(1)}_v) = \frac{d^{(1)}_v}{k}\left( \frac{1}{\pi _{1, v}} - d^{(1)}_v\right) . \end{aligned}$$

According to Theorems 1 and 5, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(i)}_v) = \frac{d^{(i)}_v}{\check{k}}\left( \frac{1}{\check{\pi }_{i, v}} - d^{(i)}_v\right) ,\quad i\in \{5, 8, 11\}. \end{aligned}$$

According to Theorems 1 and 6, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(i)}_v) = \frac{d^{(i)}_v}{\tilde{k}}\left( \frac{1}{\tilde{\pi }_{i, v}} - d^{(i)}_v\right) ,\quad i\in \{6, 9\}. \end{aligned}$$

By Theorem 2 and the definition of \(\hat{d}^{(3)}_v\), \(\hat{d}^{(10)}_v\), \(\hat{d}^{(12)}_v\), \(\hat{d}^{(13)}_v\), and \(\hat{d}^{(14)}_v\) in Eqs. (7) and (10), we have

$$\begin{aligned} \begin{aligned} \text {Var}(\hat{d}^{(i)}_v)&= \text {Var}\left( \frac{\text {Var}(\tilde{d}^{(i)}_v) \check{d}^{(i)}_v + \text {Var}(\check{d}^{(i)}_v) \tilde{d}^{(i)}_v}{\text {Var}(\check{d}^{(i)}_v) + \text {Var}(\tilde{d}^{(i)}_v)}\right) \\&= \frac{\text {Var}(\tilde{d}^{(i)}_v) \text {Var}(\check{d}^{(i)}_v)}{\text {Var}(\check{d}^{(i)}_v) + \text {Var}(\tilde{d}^{(i)}_v)}, \qquad i\in \{3, 10, 12, 13, 14\}. \end{aligned} \end{aligned}$$

In the above derivation, the last equation holds because \(\check{d}^{(i)}_v\) and \(\tilde{d}^{(i)}_v\) are independent, which can be easily obtained from their definition in Eqs. (5), (6), (8), and (9). For \(\text {Var}(\hat{d}^{(2)}_v)\), we have

$$\begin{aligned} \text {Var}(\hat{d}^{(2)}_v) = \text {Var}(\phi _v - \hat{d}^{(3)}_v) = \text {Var}(\hat{d}^{(3)}_v). \end{aligned}$$

By the definition of \(\hat{d}^{(4)}_v\) and \(\hat{d}^{(7)}_v\), we easily prove that their variances are

$$\begin{aligned} \begin{aligned} \text {Var}(\hat{d}^{(4)}_v) =&\sum _{j\in \{3, 8, 9, 10, 12, 13, 14\}} \chi _j^2 \text {Var}(\hat{d}^{(j)}_v)\\&+\sum _{j, l\in \{3, 8, 9, 10, 12, 13, 14\}\wedge j\ne l} \chi _j \chi _l \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v),\\ \text {Var}(\hat{d}^{(7)}_v) =&~ \text {Var}(\hat{d}^{(11)}_v) + \text {Var}(\hat{d}^{(13)}_v) + \text {Var}(\hat{d}^{(14)}_v)\\&+ \sum _{j, l\in \{11,13,14\}\wedge j\ne l} \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v) + \sum _{j, l\in \{11,13,14\}\wedge j\ne l} \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v). \end{aligned} \end{aligned}$$

The covariances in the above formulas of \(\text {Var}(\hat{d}^{(4)}_v)\) and \(\text {Var}(\hat{d}^{(7)}_v)\) are computed based on the following properties: (1) The covariance of estimates given by different sampling methods (e.g., Path2, Path3, and Star3) is zero; (2) the covariance of estimates given by the same sampling method is computed based on Theorem 1. For example, when \(j, l\in \{10, 12, 13, 14\}\) and \(j\ne l\), by the definition of \(\hat{d}^{(j)}_v\) in Eq. (10), we have \(\text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v)= \text {Cov}(\lambda ^{(j,1)}_v \check{d}^{(j)}_v + \lambda ^{(j,2)}_v \tilde{d}^{(j)}_v, \lambda ^{(l,1)}_v \check{d}^{(l)}_v + \lambda ^{(l,2)}_v \tilde{d}^{(l)}_v)\). By the definition of \(\check{d}^{(j)}_v\) and \(\tilde{d}^{(j)}_v\) in Eqs. (8) and (9), we easily find that \(\check{d}^{(j)}_v\) and \(\check{d}^{(l)}_v\) given by sampling method Path3 are independent with \(\tilde{d}^{(j)}_v\) and \(\tilde{d}^{(l)}_v\) given by sampling method Star3. Moreover, we have \(\text {Cov}(\check{d}^{(j)}_v,\check{d}^{(l)}_v) = -\frac{d^{(j)}_v d^{(l)}_v}{\check{k}}\) and \(\text {Cov}(\tilde{d}^{(j)}_v,\tilde{d}^{(l)}_v) = -\frac{d^{(j)}_v d^{(l)}_v}{\tilde{k}}\) from Theorem 1. Thus, we have \(\text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v) = -\frac{\lambda ^{(j,1)}_v \lambda ^{(l,1)}_v d^{(j)}_v d^{(l)}_v}{\check{k}}-\frac{\lambda ^{(j,2)}_v \lambda ^{(l,2)}_v d^{(j)}_v d^{(l)}_v}{\tilde{k}}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Zhao, J., Zhang, X. et al. SNOD: a fast sampling method of exploring node orbit degrees for large graphs. Knowl Inf Syst 61, 301–326 (2019). https://doi.org/10.1007/s10115-018-1301-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1301-z

Keywords

Navigation