SNOD: a fast sampling method of exploring node orbit degrees for large graphs

Wang, Pinghui; Zhao, Junzhou; Zhang, Xiangliang; Tao, Jing; Guan, Xiaohong

doi:10.1007/s10115-018-1301-z

SNOD: a fast sampling method of exploring node orbit degrees for large graphs

Regular Paper
Published: 13 December 2018

Volume 61, pages 301–326, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pinghui Wang^1,2,
Junzhou Zhao³,
Xiangliang Zhang³,
Jing Tao^1,2 &
…
Xiaohong Guan^1,2

336 Accesses
2 Citations
Explore all metrics

Abstract

Exploring small connected and induced subgraph patterns (CIS patterns, or graphlets) has recently attracted considerable attention. Despite recent efforts on computing how frequent a graphlet appears in a large graph (i.e., the total number of CISes isomorphic to the graphlet), little effort has been made to characterize a node’s graphlet orbit degree, i.e., the number of CISes isomorphic to the graphlet that touch the node at a particular orbit, which is an important fine-grained metric for analyzing complex networks such as learning functions/roles of nodes in social and biological networks. Like global graphlet counting, it is computationally intensive to compute node orbit degrees for a large graph. Furthermore, previous methods of computing global graphlet counts are not suited to solve this problem. In this paper, we propose a novel sampling method SNOD to efficiently estimate node orbit degrees for large-scale graphs and quantify the error of our estimates. To the best of our knowledge, we are the first to study this problem and give a fast scalable solution. We conduct experiments on a variety of real-world datasets and demonstrate that our method SNOD is several orders of magnitude faster than state-of-the-art enumeration methods for accurately estimating node orbit degrees for graphs with millions of edges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

Guided sampling for large graphs

Article 18 March 2020

Quad Census Computation: Simple, Efficient, and Orbit-Aware

Notes

The values of orbit IDs in Fig. 2 have no specific meaning. We set the values of orbit IDs same as [27].
www.snap.stanford.edu.
The concentration of a particular k-node graphlet in a network refers to the ratio of the graphlet count to the total number of k-node CISes in the network, $k=3, 4, 5, \ldots $.
Streaming graph is given in form of a stream of edges.

References

Ahmed NK, Neville J, Rossi RA, Duffield N (2015) Efficient graphlet counting for large networks. In: ICDM
Ahmed N, Duffield N, Neville J, Kompella R (2014) Graph sample and hold: a framework for big-graph analytics. In: KDD, pp 589–597
Alon N, Yuster R, Zwick U (1995) Color-coding. J ACM 42(4):844–856. https://doi.org/10.1145/210332.210337
Article MathSciNet MATH Google Scholar
Aparicio DO, Ribeiro PMP, da Silva FMA (2014) Parallel subgraph counting for multicore architectures. In: ISPA, pp 34–41
Benson AR, Gleich DF, Leskovec J (2016) Higher-order organization of complex networks. Science 353(6295):163–166
Article Google Scholar
Bhuiyan MA, Rahman M, Rahman M, Hasan MA (2012) Guise: Uniform sampling of graphlets for large graph analysis. In: ICDM, pp 91–100
Chu S, Cheng J (2011) Triangle listing in massive networks and its applications. In: KDD, pp 672–680
Dave VS, Ahmed NK, Hasan MH (2017) E-CLoG: counting edge-centric local graphlets. In: BigData, pp 586–595
Elenberg ER, Shanmugam K, Borokhovich M, Dimakis AG (2015) Beyond triangles: a distributed framework for estimating 3-profiles of large graphs. In: KDD, pp 229–238
Elenberg ER, Shanmugam K, Borokhovich M, Dimakis AG (2016) Distributed estimation of graph 4-profiles. In: WWW
Fang M, Yin J, Zhu X, Zhang C (2015) Trgraph: cross-network transfer learning via common signature subgraphs. TKDE 27(9):2536–2549
Google Scholar
Google programming contest. http://www.google.com/programming-contest/ (2002)
Graybill FA, Deal RB (1959) Combining unbiased estimators. Biometrics 15(4):543–550
Article MathSciNet MATH Google Scholar
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: KDD
Henderson K, Gallagher B, Eliassi-Rad T, Tong H, Basu S, Akoglu L, Koutra D, Faloutsos C, Li L (2012) Rolx: structural role extraction and mining in large graphs. In: KDD, pp 1231–1239
Jha M, Seshadhri C, Pinar A (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: KDD, pp 589–597
Jha M, Seshadhri C, Pinar A (2015) Path sampling: a fast and provable method for estimating 4-vertex subgraph counts. In: WWW, pp 495–505
Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Article Google Scholar
Leskovec J, Huttenlocher D, Kleinberg J (2010) Predicting positive and negative links in online social networks. In: WWW, pp 641–650
Marcus D, Shavitt Y (2012) Rage: a rapid graphlet enumerator for large networks. Comput Netw 56(2):810–819
Article Google Scholar
Milenkovic T, Przulj N (2008) Uncovering biological network function via graphlet degree signatures. Cancer Inform 6:257–273
Article Google Scholar
Milenkovic T, Memisevic V, Ganesan AK, Przulj N (2010) Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data. J R Soc Interface 7(44):423–437
Article Google Scholar
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42
Omidi S, Schreiber F, Masoudi-nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genet Syst 84(5):385–395
Article Google Scholar
Pavany A, Tangwongsan K, Tirthapuraz S, Wu KL (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881
Pinar A, Seshadhr C, Visha V (2017) Escape: efficiently counting all 5-vertex subgraphs. In: WWW
Przulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23(2):177–183
Article Google Scholar
Rahman M, Bhuiyan M, Hasan MA (2012) Graft: an approximate graphlet counting algorithm for large graph analysis. In: CIKM
Rossi RA, Zhou R, Ahmed NK (2017) Estimation of graphlet statistics. CoRR abs/1701.01772
Schank T (2007) Algorithmic aspects of triangle-based network analysis. Ph.D. in Computer Science
Seshadhri C, Pinar A, Kolda TG (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307
Article MathSciNet Google Scholar
Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp 625–636
Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: WWW, pp 607–614
Takac L, Zabovsky M (2012) Data analysis in public social networks. In: International scientific conference and international workshop present day trends of innovations, pp 1–6
Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) Doulion: Counting triangles in massive graphs with a coin. In: KDD
Wang P, Lui JC, Zhao J, Ribeiro B, Towsley D, Guan X (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27
Article Google Scholar
Wang P, Lui JCS, Towsley D (2015) Minfer: inferring motif statistics from sampled edges. In: ICDE
Wang P, Tao J, Zhao J, Guan X (2015) Moss: a scalable tool for efficiently sampling and counting 4- and 5-node graphlets. CoRR abs/1509.08089. arXiv:1509.08089
Wei B, Liu J, Ma J, Zheng Q, Zhang W, Feng B (2014) Motif-based hyponym relation extraction from wikipedia hyperlinks. TKDE 26(10):2507–2519
Google Scholar
Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359
Article Google Scholar
Ye J, Cheng H, Zhu Z, Chen M (2013) Predicting positive and negative links in signed social networks by transfer learning. In: WWW, pp 1477–1488

Download references

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful feedback. The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1301254, 61603290, 61602371), the Ministry of Education&China Mobile Research Fund (MCM20160311), 111 International Collaboration Program of China, China Postdoctoral Science Foundation (2015M582663), Shenzhen Basic Research Grant (JCYJ20160229195940462, JCYJ20170816100819428), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034).

Author information

Authors and Affiliations

NSKEYLAB, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Pinghui Wang, Jing Tao & Xiaohong Guan
Shenzhen Research School, Xi’an Jiaotong University, Shenzhen, Guangdong, China
Pinghui Wang, Jing Tao & Xiaohong Guan
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Junzhou Zhao & Xiangliang Zhang

Authors

Pinghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiangliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Tao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Junzhou Zhao or Jing Tao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Implementation details

We discuss our methods for implementing the functions in Algorithms 1, 2, and 3. We also analyze their computational complexities.

Initialization of$\phi _v$, $\varphi _v$, $\check{\varphi }_v$, $\tilde{\varphi }_v$, $\psi _v$, and$\gamma _v$: For each node v, we store its degree $d_v$ and its neighbors’ degrees in a list. Therefore, O(1) and $O(d_v)$ operations are required to compute $\phi _v$ and $\varphi _v$, respectively. Similarly, one can easily find that $O(N_v)$, $O(N_v)$, $O(\sum _{u\in N_v} d_u)$, and O(1) operations are required to compute $\check{\varphi }_v$, $\tilde{\varphi }_v$, $\psi _v$, and $\gamma _v$, respectively.

$\mathbf RandomVertex (N_v- \{u\})$: We use an array $N_v[1, \ldots , d_v]$ to store the neighbors of v. Let $i_{v,u}$ denote the index of u in the list $N_v[1, \ldots , d_v]$, i.e., $N_v[i_{v,u}]=u$. Then, function $\text {RandomVertex}(N_v- \{u\})$ includes the following steps:

Step 1. Select a number z from $\{1, \ldots , d_v\}-\{i_{v,u}\}$ at random;
Step 2. Return $N_v[z]$.

Therefore, the computational complexity of $\text {RandomVertex}(N_v- \{u\})$ is O(1).

$\mathbf RandomVertex (N_v- \{u, w\})$: Similarly, it includes the following steps:

Step 1. Select a number z from $\{1, \ldots , d_v\}- \{i_{v,u}, i_{v,w}\}$ at random;
Step 2. Return $N_v[z]$.

Therefore, the computational complexity of $\text {RandomVertex}(N_v- \{u, w\})$ is O(1).

$\mathbf WeightRandomVertex (N_v, \varvec{\alpha }^{(v)})$: We store an array $A_\alpha ^{(v)}$ in memory, where $A_\alpha ^{(v)}[i]$ is defined as $A_\alpha ^{(v)}[i] = \sum _{j=1}^i (d_{N_v[j]} - 1)$, $1\le i\le d_v$. Let $A_\alpha ^{(v)}[0]=0$. Then, $\text {WeightRandomVertex}(N_v, \varvec{\alpha }^{(v)})$ includes the following steps:

Step 1. Select a number z from $\{1, \ldots , A_\alpha ^{(v)}[d_v]\}$ at random;
Step 2. Find i such that
$$\begin{aligned} A_\alpha ^{(v)}[i-1] < z \le A_\alpha ^{(v)}[i], \end{aligned}$$
which is solved by binary search;
Step 3. Return $N_v[i]$.

Therefore, the computational complexity of function $\text {WeightRandomVertex}(N_v, \varvec{\alpha }^{(v)})$ is $O(\log d_v)$.

$\mathbf WeightRandomVertex (N_v, \varvec{\beta }^{(v)})$: We store an array $A_\beta ^{(v)}$ in memory, where $A_\beta ^{(v)}[i]$ is defined as

$$\begin{aligned} A_\beta ^{(v)}[i] = \sum _{j=1}^i (\phi _{N_v[j]} - d_{N_v[j]} + 1), \quad 1\le i\le d_v. \end{aligned}$$

Let $A_\beta ^{(v)}[0]=0$. Then, $\text {WeightRandomVertex}(N_v, \varvec{\beta }^{(v)})$ includes the following steps:

Step 1. Select a number z from $\{1, \ldots , A_\beta ^{(v)}[d_v]\}$ at random;
Step 2. Find i such that
$$\begin{aligned} A_\beta ^{(v)}[i-1] < z \le A_\beta ^{(v)}[i], \end{aligned}$$
which again is solved by binary search;
Step 3. Return $N_v[i]$.

Therefore, the computational complexity of function $\text {WeightRandomVertex}(N_v, \varvec{\beta }^{(v)})$ is $O(\log d_v)$.

1.2 Proof of Theorem 7

According to Theorems 1 and 4, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(1)}_v) = \frac{d^{(1)}_v}{k}\left( \frac{1}{\pi _{1, v}} - d^{(1)}_v\right) . \end{aligned}$$

According to Theorems 1 and 5, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(i)}_v) = \frac{d^{(i)}_v}{\check{k}}\left( \frac{1}{\check{\pi }_{i, v}} - d^{(i)}_v\right) ,\quad i\in \{5, 8, 11\}. \end{aligned}$$

According to Theorems 1 and 6, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(i)}_v) = \frac{d^{(i)}_v}{\tilde{k}}\left( \frac{1}{\tilde{\pi }_{i, v}} - d^{(i)}_v\right) ,\quad i\in \{6, 9\}. \end{aligned}$$

By Theorem 2 and the definition of $\hat{d}^{(3)}_v$, $\hat{d}^{(10)}_v$, $\hat{d}^{(12)}_v$, $\hat{d}^{(13)}_v$, and $\hat{d}^{(14)}_v$ in Eqs. (7) and (10), we have

$$\begin{aligned} \begin{aligned} \text {Var}(\hat{d}^{(i)}_v)&= \text {Var}\left( \frac{\text {Var}(\tilde{d}^{(i)}_v) \check{d}^{(i)}_v + \text {Var}(\check{d}^{(i)}_v) \tilde{d}^{(i)}_v}{\text {Var}(\check{d}^{(i)}_v) + \text {Var}(\tilde{d}^{(i)}_v)}\right) \\&= \frac{\text {Var}(\tilde{d}^{(i)}_v) \text {Var}(\check{d}^{(i)}_v)}{\text {Var}(\check{d}^{(i)}_v) + \text {Var}(\tilde{d}^{(i)}_v)}, \qquad i\in \{3, 10, 12, 13, 14\}. \end{aligned} \end{aligned}$$

In the above derivation, the last equation holds because $\check{d}^{(i)}_v$ and $\tilde{d}^{(i)}_v$ are independent, which can be easily obtained from their definition in Eqs. (5), (6), (8), and (9). For $\text {Var}(\hat{d}^{(2)}_v)$, we have

$$\begin{aligned} \text {Var}(\hat{d}^{(2)}_v) = \text {Var}(\phi _v - \hat{d}^{(3)}_v) = \text {Var}(\hat{d}^{(3)}_v). \end{aligned}$$

By the definition of $\hat{d}^{(4)}_v$ and $\hat{d}^{(7)}_v$, we easily prove that their variances are

$$\begin{aligned} \begin{aligned} \text {Var}(\hat{d}^{(4)}_v) =&\sum _{j\in \{3, 8, 9, 10, 12, 13, 14\}} \chi _j^2 \text {Var}(\hat{d}^{(j)}_v)\\&+\sum _{j, l\in \{3, 8, 9, 10, 12, 13, 14\}\wedge j\ne l} \chi _j \chi _l \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v),\\ \text {Var}(\hat{d}^{(7)}_v) =&~ \text {Var}(\hat{d}^{(11)}_v) + \text {Var}(\hat{d}^{(13)}_v) + \text {Var}(\hat{d}^{(14)}_v)\\&+ \sum _{j, l\in \{11,13,14\}\wedge j\ne l} \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v) + \sum _{j, l\in \{11,13,14\}\wedge j\ne l} \text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v). \end{aligned} \end{aligned}$$

The covariances in the above formulas of $\text {Var}(\hat{d}^{(4)}_v)$ and $\text {Var}(\hat{d}^{(7)}_v)$ are computed based on the following properties: (1) The covariance of estimates given by different sampling methods (e.g., Path2, Path3, and Star3) is zero; (2) the covariance of estimates given by the same sampling method is computed based on Theorem 1. For example, when $j, l\in \{10, 12, 13, 14\}$ and $j\ne l$, by the definition of $\hat{d}^{(j)}_v$ in Eq. (10), we have $\text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v)= \text {Cov}(\lambda ^{(j,1)}_v \check{d}^{(j)}_v + \lambda ^{(j,2)}_v \tilde{d}^{(j)}_v, \lambda ^{(l,1)}_v \check{d}^{(l)}_v + \lambda ^{(l,2)}_v \tilde{d}^{(l)}_v)$. By the definition of $\check{d}^{(j)}_v$ and $\tilde{d}^{(j)}_v$ in Eqs. (8) and (9), we easily find that $\check{d}^{(j)}_v$ and $\check{d}^{(l)}_v$ given by sampling method Path3 are independent with $\tilde{d}^{(j)}_v$ and $\tilde{d}^{(l)}_v$ given by sampling method Star3. Moreover, we have $\text {Cov}(\check{d}^{(j)}_v,\check{d}^{(l)}_v) = -\frac{d^{(j)}_v d^{(l)}_v}{\check{k}}$ and $\text {Cov}(\tilde{d}^{(j)}_v,\tilde{d}^{(l)}_v) = -\frac{d^{(j)}_v d^{(l)}_v}{\tilde{k}}$ from Theorem 1. Thus, we have $\text {Cov}(\hat{d}^{(j)}_v, \hat{d}^{(l)}_v) = -\frac{\lambda ^{(j,1)}_v \lambda ^{(l,1)}_v d^{(j)}_v d^{(l)}_v}{\check{k}}-\frac{\lambda ^{(j,2)}_v \lambda ^{(l,2)}_v d^{(j)}_v d^{(l)}_v}{\tilde{k}}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, P., Zhao, J., Zhang, X. et al. SNOD: a fast sampling method of exploring node orbit degrees for large graphs. Knowl Inf Syst 61, 301–326 (2019). https://doi.org/10.1007/s10115-018-1301-z

Download citation

Received: 12 June 2017
Revised: 15 September 2018
Accepted: 24 November 2018
Published: 13 December 2018
Issue Date: 01 October 2019
DOI: https://doi.org/10.1007/s10115-018-1301-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SNOD: a fast sampling method of exploring node orbit degrees for large graphs

Abstract

Access this article

Similar content being viewed by others

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

Guided sampling for large graphs

Quad Census Computation: Simple, Efficient, and Orbit-Aware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

1.1 Implementation details

1.2 Proof of Theorem 7

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SNOD: a fast sampling method of exploring node orbit degrees for large graphs

Abstract

Access this article

Similar content being viewed by others

SSRW: A Scalable Algorithm for Estimating Graphlet Statistics Based on Random Walk

Guided sampling for large graphs

Quad Census Computation: Simple, Efficient, and Orbit-Aware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Appendix

1.1 Implementation details

1.2 Proof of Theorem 7

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation