Skip to main content
Log in

Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We study the problem of estimating the value of sums of the form \(S_p \triangleq \sum \left( {\begin{array}{c}x_i\\ p\end{array}}\right) \) when one has the ability to sample \(x_i \ge 0\) with probability proportional to its magnitude. When \(p=2\), this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when \(\{x_i\}\) is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a \((1 \pm \varepsilon )\)-multiplicative approximation of \(S_p\) has query and time complexities \(\mathrm{O}\left( \frac{m \log \log n}{\epsilon ^2 S_p^{1/p}}\right) \). Here, \(m=\sum x_i/2\) is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when \(\{x_i\}\) is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where \(S_p\le n\), and \(p=2\), our upper bound is \(\tilde{O}(n/S_p^{1/2})\), in contrast to their \(\varOmega (n/S_p^{1/3})\) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. One useful technique for giving lower bounds on sublinear time algorithms, pioneered by [12], is to make use of a connection between lower bounds in communication complexity and lower bounds on sublinear time algorithms. More specifically, by giving a reduction from a communication complexity problem to the problem we want to solve, a lower bound on the communication complexity problem yields a lower bound on our problem. In the past, this approach has led to simpler and cleaner sublinear time lower bounds for many problems. Attempts at such an approach for reducing the set-disjointness problem in communication complexity to our estimation problem on graphs run into the following difficulties: First, as explained in [22], the straightforward reduction adds a logarithmic overhead, thereby weakening the lower bound by the same factor. Second, the reduction seems to work only in the case of sparse graphs. Although it is not clear if these difficulties are insurmountable, it seems that it will not give a simpler argument than the approach that we present in this work.

  2. For our counting purpose, if \(x < y\) then we define \({x \atopwithdelims ()y} = 0\).

  3. We may use the binomial coefficients \({x \atopwithdelims ()y}\) for non-integral value x in the inequalities. These can be interpreted through alternative formulations of binomial coefficients using falling factorials or analytic functions.

  4. See e.g., [42] for more information on Yao’s principle.

  5. To be consistent with our notation where indices begin at 1, let \(x\;\mathrm {mod}\;y=y\) when x is a multiple of y. (Otherwise, \(x\;\mathrm {mod}\;y\) still denotes the remainder of \(x \div y\).)

  6. A permutation \(\pi \) over [n] is a bijection \(\pi : [n] \rightarrow [n]\).

References

  1. Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp. 5–14 (2012). doi:10.1145/2213556.2213560

  2. Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. In: Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19–23, 2008, pp. 241–249 (2008). doi:10.1093/bioinformatics/btn163

  3. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64(3), 719–747 (2002). doi:10.1006/jcss.2001.1813

    Article  MathSciNet  MATH  Google Scholar 

  4. Alon, N., Gutner, S.: Balanced hashing, color coding and approximate counting. In: Parameterized and Exact Computation, 4th International Workshop, IWPEC 2009, Copenhagen, Denmark, September 10–11, 2009, Revised Selected Papers, pp. 1–16 (2009). doi:10.1007/978-3-642-11269-0_1

  5. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545

    Article  MathSciNet  MATH  Google Scholar 

  6. Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997). doi:10.1007/BF02523189

    Article  MathSciNet  MATH  Google Scholar 

  7. Amini, O., Fomin, F.V., Saurabh, S.: Counting subgraphs via homomorphisms. SIAM J. Discrete Math. 26(2), 695–717 (2012). doi:10.1137/100789403

    Article  MathSciNet  MATH  Google Scholar 

  8. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6–8, 2002, San Francisco, CA, USA., pp. 623–632 (2002). http://dl.acm.org/citation.cfm?id=545381.545464

  9. Batu, T., Berenbrink, P., Sohler, C.: A sublinear-time approximation scheme for bin packing. Theor. Comput. Sci. 410(47–49), 5082–5092 (2009). doi:10.1016/j.tcs.2009.08.006

    Article  MathSciNet  MATH  Google Scholar 

  10. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, pp. 16–24 (2008). doi:10.1145/1401890.1401898

  11. Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 708–713. ACM (2006)

  12. Blais, E., Brody, J., Matulef, K.: Property testing lower bounds via communication complexity. Comput. Complex. 21(2), 311–358 (2012). doi:10.1007/s00037-012-0040-x

    Article  MathSciNet  MATH  Google Scholar 

  13. Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 26–28, 2006, Chicago, Illinois, USA, pp. 253–262 (2006). doi:10.1145/1142351.1142388

  14. Canonne, C.L., Rubinfeld, R.: Testing probability distributions underlying aggregated data. In: Automata, Languages, and Programming—41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8–11, 2014, Proceedings, Part I, pp. 283–295 (2014). doi:10.1007/978-3-662-43948-7_24

  15. Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11–14, 2004, pp. 151–156 (2004). http://dl.acm.org/citation.cfm?id=982792.982815

  16. Duke, R.A., Lefmann, H., Rödl, V.: A fast approximation algorithm for computing the frequencies of subgraphs in a given graph. SIAM J. Comput. 24(3), 598–620 (1995). doi:10.1137/S0097539793247634

    Article  MathSciNet  MATH  Google Scholar 

  17. Eden, T., Levi, A., Ron, D., Seshadhri, C.: Approximately counting triangles in sublinear time. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17–20 October, 2015, pp. 614–633 (2015). doi:10.1109/FOCS.2015.44

  18. Feige, U.: On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. 35(4), 964–984 (2006). doi:10.1137/S0097539704447304

    Article  MathSciNet  MATH  Google Scholar 

  19. Flum, J., Grohe, M.: The parameterized complexity of counting problems. SIAM J. Comput. 33(4), 892–922 (2004). doi:10.1137/S0097539703427203

    Article  MathSciNet  MATH  Google Scholar 

  20. Fomin, F.V., Lokshtanov, D., Raman, V., Saurabh, S., Rao, B.V.R.: Faster algorithms for finding and counting subgraphs. J. Comput. Syst. Sci. 78(3), 698–706 (2012). doi:10.1016/j.jcss.2011.10.001

    Article  MathSciNet  MATH  Google Scholar 

  21. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21–24, 2001, pp. 461–472 (2001). doi:10.1145/375663.375727

  22. Goldreich, O.: On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electron. Colloq. Comput. Complex. (ECCC) 20, 73 (2013). http://eccc.hpi-web.de/report/2013/073

  23. Goldreich, O., Ron, D.: Approximating average parameters of graphs. Random Struct. Algorithms 32(4), 473–493 (2008). doi:10.1002/rsa.20203

    Article  MathSciNet  MATH  Google Scholar 

  24. Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math. 25(3), 1365–1411 (2011). doi:10.1137/100783066

    Article  MathSciNet  MATH  Google Scholar 

  25. Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Internet Math. 6(3), 349–372 (2009). doi:10.1080/15427951.2009.10390645

    Article  MathSciNet  MATH  Google Scholar 

  26. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Research in Computational Molecular Biology, 11th Annual International Conference, RECOMB 2007, Oakland, CA, USA, April 21–25, 2007, Proceedings, pp. 92–106 (2007). doi:10.1007/978-3-540-71681-5_7

  27. Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223–250 (2009). doi:10.1002/sam.10016

    Article  MathSciNet  Google Scholar 

  28. Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550–569 (1996). doi:10.1006/jcss.1996.0041

    Article  MathSciNet  MATH  Google Scholar 

  29. Hales, D., Arteconi, S.: Motifs in evolving cooperative networks look like protein structure networks. NHM 3(2), 239–249 (2008). doi:10.3934/nhm.2008.3.239

    Article  MathSciNet  MATH  Google Scholar 

  30. Hassidim, A., Kelner, J.A., Nguyen, H.N., Onak, K.: Local graph partitions for approximation and testing. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25–27, 2009, Atlanta, Georgia, USA, pp. 22–31 (2009). doi:10.1109/FOCS.2009.77

  31. Hormozdiari, F., Berenbrink, P., Przulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: The role of the seed graph in PPI network evolution. PLoS Comput Biol (2007). doi:10.1371/journal.pcbi.0030118

    Google Scholar 

  32. Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22–24, 2005, pp. 202–208 (2005). doi:10.1145/1060590.1060621

  33. Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting arbitrary subgraphs in data streams. In: Automata, Languages, and Programming—39th International Colloquium, ICALP 2012, Warwick, UK, July 9–13, 2012, Proceedings, Part II, pp. 598–609 (2012). doi:10.1007/978-3-642-31585-5_53

  34. Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). doi:10.1093/bioinformatics/bth163

    Article  Google Scholar 

  35. Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012). doi:10.1080/15427951.2012.625260

    Article  MathSciNet  MATH  Google Scholar 

  36. Lee, J., Kim, D., Chung, C.: Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA., pp. 205–214 (1999). doi:10.1145/304182.304200

  37. Manjunath, M., Mehlhorn, K., Panagiotou, K., Sun, H.: Approximate counting of cycles in streams. In: Algorithms—ESA 2011—19th Annual European Symposium, Saarbrücken, Germany, September 5–9, 2011. Proceedings, pp. 677–688 (2011). doi:10.1007/978-3-642-23719-5_57

  38. Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007). doi:10.1007/s00778-006-0030-1

    Article  Google Scholar 

  39. McGregor, A., Vorotnikova, S., Vu, H.T.: Better algorithms for counting triangles in data streams. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 401–411 (2016). doi:10.1145/2902251.2902283

  40. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)

    Article  Google Scholar 

  41. Motwani, R., Panigrahy, R., Xu, Y.: Estimating sum by weighted sampling. In: Automata, Languages and Programming, 34th International Colloquium, ICALP 2007, Wroclaw, Poland, July 9–13, 2007, Proceedings, pp. 53–64 (2007). doi:10.1007/978-3-540-73420-8_7

  42. Motwani, R., Raghavan, P.: Randomized Algorithms. Chapman & Hall/CRC, London (2010)

    MATH  Google Scholar 

  43. Nguyen, H.N., Onak, K.: Constant-time approximation algorithms via local improvements. In: 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25–28, 2008, Philadelphia, PA, USA, pp. 327–336 (2008). doi:10.1109/FOCS.2008.81

  44. Onak, K., Ron, D., Rosen, M., Rubinfeld, R.: A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17–19, 2012, pp. 1123–1131 (2012).http://portal.acm.org/citation.cfm?id=2095204&CFID=63838676&CFTOKEN=79617016

  45. Parnas, M., Ron, D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1–3), 183–196 (2007). doi:10.1016/j.tcs.2007.04.040

    Article  MathSciNet  MATH  Google Scholar 

  46. Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25–29, 1997, Athens, Greece, pp. 486–495 (1997). http://www.vldb.org/conf/1997/P486.PDF

  47. Przulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). doi:10.1093/bioinformatics/bth436

    Article  Google Scholar 

  48. Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comput. Biol. 13(2), 133–144 (2006). doi:10.1089/cmb.2006.13.133

    Article  MathSciNet  MATH  Google Scholar 

  49. Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinform. 7, 199 (2006). doi:10.1186/1471-2105-7-199

    Article  Google Scholar 

  50. Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. In: Advances in Database Technology—EDBT’94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28–31, 1994, Proceedings, pp. 287–300 (1994). doi:10.1007/3-540-57818-8_58

  51. Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biology Bioinform. 3(4), 347–359 (2006). doi:10.1109/TCBB.2006.51

    Article  Google Scholar 

  52. Williams, R.: Finding paths of length k in \(\text{ o }{}^{*} (2^{\text{ k }})\) time. Inf. Process. Lett. 109(6), 315–318 (2009). doi:10.1016/j.ipl.2008.11.004

    Article  MathSciNet  MATH  Google Scholar 

  53. Williams, V.V., Williams, R.: Finding, minimizing, and counting weighted subgraphs. SIAM J. Comput. 42(3), 831–854 (2013). doi:10.1137/09076619X

    Article  MathSciNet  MATH  Google Scholar 

  54. Yoshida, Y., Yamamoto, M., Ito, H.: Improved constant-time approximation algorithms for maximum matchings and other optimization problems. SIAM J. Comput. 41(4), 1074–1093 (2012). doi:10.1137/110828691

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Aliakbarpour, Gouleakis, Peebles, Rubinfeld and Yodpinyanee were supported by the National Science Foundation Graduate Research Fellowship under Grant No. CCF-1217423, CCF-1065125 and CCF-1420692. Peebles was also supported by Grant No. CCF-1122374. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. In addition, Rubinfeld was supported by the Israel Science Foundation grant 1536/14, and Yodpinyanee was supported by the Development and Promotion of Science and Technology Talents Project scholarship, Royal Thai Government. We thank Dana Ron for her valuable contribution to this paper. We thank Peter Haas and Samuel Madden for helpful discussions. We thank anonymous reviewers for their insightful comments on the preliminary version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anak Yodpinyanee.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix: Useful Inequalities

Appendix: Useful Inequalities

This section provides standard equalities that we use throughout our paper. These inequalities exist in many variations, but here we only present the formulations which are most convenient for our purposes.

Theorem 10

(Chebyshev’s Inequality) For any random variable X and \(a > 0\),

$$\begin{aligned} \mathrm{P}[|X - \mathrm{E}[X]| \ge a] \le \frac{{\text{ Var }}[X]}{a^2}. \end{aligned}$$

Theorem 11

(Markov’s Inequality) For any non-negative random variable X and \(a > 0\),

$$\begin{aligned} \mathrm{P}[X \ge a] \le \frac{\mathrm{E}[X]}{a}. \end{aligned}$$

Theorem 12

(Chernoff Bound) Let \(X_1, \ldots , X_n\) be independent Bernoulli random variables such that \(\mathrm{P}[X_i = 1] = p\) for all \(i \in [n]\), and let \(X = \frac{1}{n}\sum _{i=1}^{n} X_i\). Then for any \(0 < \delta \le 1\),

$$\begin{aligned} \mathrm{P}[X< (1-\delta ) p] < e^{-\delta ^2pn/2}. \end{aligned}$$

Theorem 13

(Jensen’s Inequality) For any real convex function f with \(x_1, \ldots , x_n\) in its domain,

$$\begin{aligned} \sum _{i=1}^n f(x_i) \ge n f\left( \sum _{i=1}^n x_i\right) . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aliakbarpour, M., Biswas, A.S., Gouleakis, T. et al. Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling. Algorithmica 80, 668–697 (2018). https://doi.org/10.1007/s00453-017-0287-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0287-3

Keywords

Navigation