Advertisement

Algorithmica

, Volume 80, Issue 2, pp 668–697 | Cite as

Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling

  • Maryam Aliakbarpour
  • Amartya Shankha Biswas
  • Themis Gouleakis
  • John Peebles
  • Ronitt Rubinfeld
  • Anak Yodpinyanee
Article
  • 384 Downloads

Abstract

We study the problem of estimating the value of sums of the form \(S_p \triangleq \sum \left( {\begin{array}{c}x_i\\ p\end{array}}\right) \) when one has the ability to sample \(x_i \ge 0\) with probability proportional to its magnitude. When \(p=2\), this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when \(\{x_i\}\) is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a \((1 \pm \varepsilon )\)-multiplicative approximation of \(S_p\) has query and time complexities \(\mathrm{O}\left( \frac{m \log \log n}{\epsilon ^2 S_p^{1/p}}\right) \). Here, \(m=\sum x_i/2\) is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when \(\{x_i\}\) is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where \(S_p\le n\), and \(p=2\), our upper bound is \(\tilde{O}(n/S_p^{1/2})\), in contrast to their \(\varOmega (n/S_p^{1/3})\) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.

Keywords

Subgraphs Approximate counting Randomized algorithms Sublinear-time algorithms 

Notes

Acknowledgements

Aliakbarpour, Gouleakis, Peebles, Rubinfeld and Yodpinyanee were supported by the National Science Foundation Graduate Research Fellowship under Grant No. CCF-1217423, CCF-1065125 and CCF-1420692. Peebles was also supported by Grant No. CCF-1122374. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. In addition, Rubinfeld was supported by the Israel Science Foundation grant 1536/14, and Yodpinyanee was supported by the Development and Promotion of Science and Technology Talents Project scholarship, Royal Thai Government. We thank Dana Ron for her valuable contribution to this paper. We thank Peter Haas and Samuel Madden for helpful discussions. We thank anonymous reviewers for their insightful comments on the preliminary version of this paper.

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp. 5–14 (2012). doi: 10.1145/2213556.2213560
  2. 2.
    Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. In: Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19–23, 2008, pp. 241–249 (2008). doi: 10.1093/bioinformatics/btn163
  3. 3.
    Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64(3), 719–747 (2002). doi: 10.1006/jcss.2001.1813 MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Alon, N., Gutner, S.: Balanced hashing, color coding and approximate counting. In: Parameterized and Exact Computation, 4th International Workshop, IWPEC 2009, Copenhagen, Denmark, September 10–11, 2009, Revised Selected Papers, pp. 1–16 (2009). doi: 10.1007/978-3-642-11269-0_1
  5. 5.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi: 10.1006/jcss.1997.1545 MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997). doi: 10.1007/BF02523189 MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Amini, O., Fomin, F.V., Saurabh, S.: Counting subgraphs via homomorphisms. SIAM J. Discrete Math. 26(2), 695–717 (2012). doi: 10.1137/100789403 MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6–8, 2002, San Francisco, CA, USA., pp. 623–632 (2002). http://dl.acm.org/citation.cfm?id=545381.545464
  9. 9.
    Batu, T., Berenbrink, P., Sohler, C.: A sublinear-time approximation scheme for bin packing. Theor. Comput. Sci. 410(47–49), 5082–5092 (2009). doi: 10.1016/j.tcs.2009.08.006 MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, pp. 16–24 (2008). doi: 10.1145/1401890.1401898
  11. 11.
    Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 708–713. ACM (2006)Google Scholar
  12. 12.
    Blais, E., Brody, J., Matulef, K.: Property testing lower bounds via communication complexity. Comput. Complex. 21(2), 311–358 (2012). doi: 10.1007/s00037-012-0040-x MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 26–28, 2006, Chicago, Illinois, USA, pp. 253–262 (2006). doi: 10.1145/1142351.1142388
  14. 14.
    Canonne, C.L., Rubinfeld, R.: Testing probability distributions underlying aggregated data. In: Automata, Languages, and Programming—41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8–11, 2014, Proceedings, Part I, pp. 283–295 (2014). doi: 10.1007/978-3-662-43948-7_24
  15. 15.
    Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11–14, 2004, pp. 151–156 (2004). http://dl.acm.org/citation.cfm?id=982792.982815
  16. 16.
    Duke, R.A., Lefmann, H., Rödl, V.: A fast approximation algorithm for computing the frequencies of subgraphs in a given graph. SIAM J. Comput. 24(3), 598–620 (1995). doi: 10.1137/S0097539793247634 MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Eden, T., Levi, A., Ron, D., Seshadhri, C.: Approximately counting triangles in sublinear time. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17–20 October, 2015, pp. 614–633 (2015). doi: 10.1109/FOCS.2015.44
  18. 18.
    Feige, U.: On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. 35(4), 964–984 (2006). doi: 10.1137/S0097539704447304 MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Flum, J., Grohe, M.: The parameterized complexity of counting problems. SIAM J. Comput. 33(4), 892–922 (2004). doi: 10.1137/S0097539703427203 MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Fomin, F.V., Lokshtanov, D., Raman, V., Saurabh, S., Rao, B.V.R.: Faster algorithms for finding and counting subgraphs. J. Comput. Syst. Sci. 78(3), 698–706 (2012). doi: 10.1016/j.jcss.2011.10.001 MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21–24, 2001, pp. 461–472 (2001). doi: 10.1145/375663.375727
  22. 22.
    Goldreich, O.: On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electron. Colloq. Comput. Complex. (ECCC) 20, 73 (2013). http://eccc.hpi-web.de/report/2013/073
  23. 23.
    Goldreich, O., Ron, D.: Approximating average parameters of graphs. Random Struct. Algorithms 32(4), 473–493 (2008). doi: 10.1002/rsa.20203 MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math. 25(3), 1365–1411 (2011). doi: 10.1137/100783066 MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Internet Math. 6(3), 349–372 (2009). doi: 10.1080/15427951.2009.10390645 MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Research in Computational Molecular Biology, 11th Annual International Conference, RECOMB 2007, Oakland, CA, USA, April 21–25, 2007, Proceedings, pp. 92–106 (2007). doi: 10.1007/978-3-540-71681-5_7
  27. 27.
    Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223–250 (2009). doi: 10.1002/sam.10016 MathSciNetCrossRefGoogle Scholar
  28. 28.
    Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550–569 (1996). doi: 10.1006/jcss.1996.0041 MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Hales, D., Arteconi, S.: Motifs in evolving cooperative networks look like protein structure networks. NHM 3(2), 239–249 (2008). doi: 10.3934/nhm.2008.3.239 MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Hassidim, A., Kelner, J.A., Nguyen, H.N., Onak, K.: Local graph partitions for approximation and testing. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25–27, 2009, Atlanta, Georgia, USA, pp. 22–31 (2009). doi: 10.1109/FOCS.2009.77
  31. 31.
    Hormozdiari, F., Berenbrink, P., Przulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: The role of the seed graph in PPI network evolution. PLoS Comput Biol (2007). doi: 10.1371/journal.pcbi.0030118 Google Scholar
  32. 32.
    Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22–24, 2005, pp. 202–208 (2005). doi: 10.1145/1060590.1060621
  33. 33.
    Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting arbitrary subgraphs in data streams. In: Automata, Languages, and Programming—39th International Colloquium, ICALP 2012, Warwick, UK, July 9–13, 2012, Proceedings, Part II, pp. 598–609 (2012). doi: 10.1007/978-3-642-31585-5_53
  34. 34.
    Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). doi: 10.1093/bioinformatics/bth163 CrossRefGoogle Scholar
  35. 35.
    Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012). doi: 10.1080/15427951.2012.625260 MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Lee, J., Kim, D., Chung, C.: Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA., pp. 205–214 (1999). doi: 10.1145/304182.304200
  37. 37.
    Manjunath, M., Mehlhorn, K., Panagiotou, K., Sun, H.: Approximate counting of cycles in streams. In: Algorithms—ESA 2011—19th Annual European Symposium, Saarbrücken, Germany, September 5–9, 2011. Proceedings, pp. 677–688 (2011). doi: 10.1007/978-3-642-23719-5_57
  38. 38.
    Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007). doi: 10.1007/s00778-006-0030-1 CrossRefGoogle Scholar
  39. 39.
    McGregor, A., Vorotnikova, S., Vu, H.T.: Better algorithms for counting triangles in data streams. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 401–411 (2016). doi: 10.1145/2902251.2902283
  40. 40.
    Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)CrossRefGoogle Scholar
  41. 41.
    Motwani, R., Panigrahy, R., Xu, Y.: Estimating sum by weighted sampling. In: Automata, Languages and Programming, 34th International Colloquium, ICALP 2007, Wroclaw, Poland, July 9–13, 2007, Proceedings, pp. 53–64 (2007). doi: 10.1007/978-3-540-73420-8_7
  42. 42.
    Motwani, R., Raghavan, P.: Randomized Algorithms. Chapman & Hall/CRC, London (2010)zbMATHGoogle Scholar
  43. 43.
    Nguyen, H.N., Onak, K.: Constant-time approximation algorithms via local improvements. In: 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25–28, 2008, Philadelphia, PA, USA, pp. 327–336 (2008). doi: 10.1109/FOCS.2008.81
  44. 44.
    Onak, K., Ron, D., Rosen, M., Rubinfeld, R.: A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17–19, 2012, pp. 1123–1131 (2012).http://portal.acm.org/citation.cfm?id=2095204&CFID=63838676&CFTOKEN=79617016
  45. 45.
    Parnas, M., Ron, D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1–3), 183–196 (2007). doi: 10.1016/j.tcs.2007.04.040 MathSciNetCrossRefzbMATHGoogle Scholar
  46. 46.
    Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25–29, 1997, Athens, Greece, pp. 486–495 (1997). http://www.vldb.org/conf/1997/P486.PDF
  47. 47.
    Przulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). doi: 10.1093/bioinformatics/bth436 CrossRefGoogle Scholar
  48. 48.
    Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comput. Biol. 13(2), 133–144 (2006). doi: 10.1089/cmb.2006.13.133 MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinform. 7, 199 (2006). doi: 10.1186/1471-2105-7-199 CrossRefGoogle Scholar
  50. 50.
    Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. In: Advances in Database Technology—EDBT’94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28–31, 1994, Proceedings, pp. 287–300 (1994). doi: 10.1007/3-540-57818-8_58
  51. 51.
    Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biology Bioinform. 3(4), 347–359 (2006). doi: 10.1109/TCBB.2006.51 CrossRefGoogle Scholar
  52. 52.
    Williams, R.: Finding paths of length k in \(\text{ o }{}^{*} (2^{\text{ k }})\) time. Inf. Process. Lett. 109(6), 315–318 (2009). doi: 10.1016/j.ipl.2008.11.004 MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Williams, V.V., Williams, R.: Finding, minimizing, and counting weighted subgraphs. SIAM J. Comput. 42(3), 831–854 (2013). doi: 10.1137/09076619X MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Yoshida, Y., Yamamoto, M., Ito, H.: Improved constant-time approximation algorithms for maximum matchings and other optimization problems. SIAM J. Comput. 41(4), 1074–1093 (2012). doi: 10.1137/110828691 MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.CSAIL, MITCambridgeUSA
  2. 2.The Blavatnik School of Computer ScienceTel Aviv UniversityTel AvivIsrael

Personalised recommendations