Abstract
We study the problem of estimating the value of sums of the form \(S_p \triangleq \sum \left( {\begin{array}{c}x_i\\ p\end{array}}\right) \) when one has the ability to sample \(x_i \ge 0\) with probability proportional to its magnitude. When \(p=2\), this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when \(\{x_i\}\) is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a \((1 \pm \varepsilon )\)-multiplicative approximation of \(S_p\) has query and time complexities \(\mathrm{O}\left( \frac{m \log \log n}{\epsilon ^2 S_p^{1/p}}\right) \). Here, \(m=\sum x_i/2\) is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when \(\{x_i\}\) is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where \(S_p\le n\), and \(p=2\), our upper bound is \(\tilde{O}(n/S_p^{1/2})\), in contrast to their \(\varOmega (n/S_p^{1/3})\) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.
Similar content being viewed by others
Notes
One useful technique for giving lower bounds on sublinear time algorithms, pioneered by [12], is to make use of a connection between lower bounds in communication complexity and lower bounds on sublinear time algorithms. More specifically, by giving a reduction from a communication complexity problem to the problem we want to solve, a lower bound on the communication complexity problem yields a lower bound on our problem. In the past, this approach has led to simpler and cleaner sublinear time lower bounds for many problems. Attempts at such an approach for reducing the set-disjointness problem in communication complexity to our estimation problem on graphs run into the following difficulties: First, as explained in [22], the straightforward reduction adds a logarithmic overhead, thereby weakening the lower bound by the same factor. Second, the reduction seems to work only in the case of sparse graphs. Although it is not clear if these difficulties are insurmountable, it seems that it will not give a simpler argument than the approach that we present in this work.
For our counting purpose, if \(x < y\) then we define \({x \atopwithdelims ()y} = 0\).
We may use the binomial coefficients \({x \atopwithdelims ()y}\) for non-integral value x in the inequalities. These can be interpreted through alternative formulations of binomial coefficients using falling factorials or analytic functions.
See e.g., [42] for more information on Yao’s principle.
To be consistent with our notation where indices begin at 1, let \(x\;\mathrm {mod}\;y=y\) when x is a multiple of y. (Otherwise, \(x\;\mathrm {mod}\;y\) still denotes the remainder of \(x \div y\).)
A permutation \(\pi \) over [n] is a bijection \(\pi : [n] \rightarrow [n]\).
References
Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp. 5–14 (2012). doi:10.1145/2213556.2213560
Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. In: Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19–23, 2008, pp. 241–249 (2008). doi:10.1093/bioinformatics/btn163
Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64(3), 719–747 (2002). doi:10.1006/jcss.2001.1813
Alon, N., Gutner, S.: Balanced hashing, color coding and approximate counting. In: Parameterized and Exact Computation, 4th International Workshop, IWPEC 2009, Copenhagen, Denmark, September 10–11, 2009, Revised Selected Papers, pp. 1–16 (2009). doi:10.1007/978-3-642-11269-0_1
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545
Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997). doi:10.1007/BF02523189
Amini, O., Fomin, F.V., Saurabh, S.: Counting subgraphs via homomorphisms. SIAM J. Discrete Math. 26(2), 695–717 (2012). doi:10.1137/100789403
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6–8, 2002, San Francisco, CA, USA., pp. 623–632 (2002). http://dl.acm.org/citation.cfm?id=545381.545464
Batu, T., Berenbrink, P., Sohler, C.: A sublinear-time approximation scheme for bin packing. Theor. Comput. Sci. 410(47–49), 5082–5092 (2009). doi:10.1016/j.tcs.2009.08.006
Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, pp. 16–24 (2008). doi:10.1145/1401890.1401898
Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 708–713. ACM (2006)
Blais, E., Brody, J., Matulef, K.: Property testing lower bounds via communication complexity. Comput. Complex. 21(2), 311–358 (2012). doi:10.1007/s00037-012-0040-x
Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 26–28, 2006, Chicago, Illinois, USA, pp. 253–262 (2006). doi:10.1145/1142351.1142388
Canonne, C.L., Rubinfeld, R.: Testing probability distributions underlying aggregated data. In: Automata, Languages, and Programming—41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8–11, 2014, Proceedings, Part I, pp. 283–295 (2014). doi:10.1007/978-3-662-43948-7_24
Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11–14, 2004, pp. 151–156 (2004). http://dl.acm.org/citation.cfm?id=982792.982815
Duke, R.A., Lefmann, H., Rödl, V.: A fast approximation algorithm for computing the frequencies of subgraphs in a given graph. SIAM J. Comput. 24(3), 598–620 (1995). doi:10.1137/S0097539793247634
Eden, T., Levi, A., Ron, D., Seshadhri, C.: Approximately counting triangles in sublinear time. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17–20 October, 2015, pp. 614–633 (2015). doi:10.1109/FOCS.2015.44
Feige, U.: On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. 35(4), 964–984 (2006). doi:10.1137/S0097539704447304
Flum, J., Grohe, M.: The parameterized complexity of counting problems. SIAM J. Comput. 33(4), 892–922 (2004). doi:10.1137/S0097539703427203
Fomin, F.V., Lokshtanov, D., Raman, V., Saurabh, S., Rao, B.V.R.: Faster algorithms for finding and counting subgraphs. J. Comput. Syst. Sci. 78(3), 698–706 (2012). doi:10.1016/j.jcss.2011.10.001
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21–24, 2001, pp. 461–472 (2001). doi:10.1145/375663.375727
Goldreich, O.: On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electron. Colloq. Comput. Complex. (ECCC) 20, 73 (2013). http://eccc.hpi-web.de/report/2013/073
Goldreich, O., Ron, D.: Approximating average parameters of graphs. Random Struct. Algorithms 32(4), 473–493 (2008). doi:10.1002/rsa.20203
Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math. 25(3), 1365–1411 (2011). doi:10.1137/100783066
Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Internet Math. 6(3), 349–372 (2009). doi:10.1080/15427951.2009.10390645
Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Research in Computational Molecular Biology, 11th Annual International Conference, RECOMB 2007, Oakland, CA, USA, April 21–25, 2007, Proceedings, pp. 92–106 (2007). doi:10.1007/978-3-540-71681-5_7
Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223–250 (2009). doi:10.1002/sam.10016
Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550–569 (1996). doi:10.1006/jcss.1996.0041
Hales, D., Arteconi, S.: Motifs in evolving cooperative networks look like protein structure networks. NHM 3(2), 239–249 (2008). doi:10.3934/nhm.2008.3.239
Hassidim, A., Kelner, J.A., Nguyen, H.N., Onak, K.: Local graph partitions for approximation and testing. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25–27, 2009, Atlanta, Georgia, USA, pp. 22–31 (2009). doi:10.1109/FOCS.2009.77
Hormozdiari, F., Berenbrink, P., Przulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: The role of the seed graph in PPI network evolution. PLoS Comput Biol (2007). doi:10.1371/journal.pcbi.0030118
Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22–24, 2005, pp. 202–208 (2005). doi:10.1145/1060590.1060621
Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting arbitrary subgraphs in data streams. In: Automata, Languages, and Programming—39th International Colloquium, ICALP 2012, Warwick, UK, July 9–13, 2012, Proceedings, Part II, pp. 598–609 (2012). doi:10.1007/978-3-642-31585-5_53
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). doi:10.1093/bioinformatics/bth163
Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012). doi:10.1080/15427951.2012.625260
Lee, J., Kim, D., Chung, C.: Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA., pp. 205–214 (1999). doi:10.1145/304182.304200
Manjunath, M., Mehlhorn, K., Panagiotou, K., Sun, H.: Approximate counting of cycles in streams. In: Algorithms—ESA 2011—19th Annual European Symposium, Saarbrücken, Germany, September 5–9, 2011. Proceedings, pp. 677–688 (2011). doi:10.1007/978-3-642-23719-5_57
Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007). doi:10.1007/s00778-006-0030-1
McGregor, A., Vorotnikova, S., Vu, H.T.: Better algorithms for counting triangles in data streams. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 401–411 (2016). doi:10.1145/2902251.2902283
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
Motwani, R., Panigrahy, R., Xu, Y.: Estimating sum by weighted sampling. In: Automata, Languages and Programming, 34th International Colloquium, ICALP 2007, Wroclaw, Poland, July 9–13, 2007, Proceedings, pp. 53–64 (2007). doi:10.1007/978-3-540-73420-8_7
Motwani, R., Raghavan, P.: Randomized Algorithms. Chapman & Hall/CRC, London (2010)
Nguyen, H.N., Onak, K.: Constant-time approximation algorithms via local improvements. In: 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25–28, 2008, Philadelphia, PA, USA, pp. 327–336 (2008). doi:10.1109/FOCS.2008.81
Onak, K., Ron, D., Rosen, M., Rubinfeld, R.: A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17–19, 2012, pp. 1123–1131 (2012).http://portal.acm.org/citation.cfm?id=2095204&CFID=63838676&CFTOKEN=79617016
Parnas, M., Ron, D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1–3), 183–196 (2007). doi:10.1016/j.tcs.2007.04.040
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25–29, 1997, Athens, Greece, pp. 486–495 (1997). http://www.vldb.org/conf/1997/P486.PDF
Przulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). doi:10.1093/bioinformatics/bth436
Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comput. Biol. 13(2), 133–144 (2006). doi:10.1089/cmb.2006.13.133
Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinform. 7, 199 (2006). doi:10.1186/1471-2105-7-199
Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. In: Advances in Database Technology—EDBT’94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28–31, 1994, Proceedings, pp. 287–300 (1994). doi:10.1007/3-540-57818-8_58
Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biology Bioinform. 3(4), 347–359 (2006). doi:10.1109/TCBB.2006.51
Williams, R.: Finding paths of length k in \(\text{ o }{}^{*} (2^{\text{ k }})\) time. Inf. Process. Lett. 109(6), 315–318 (2009). doi:10.1016/j.ipl.2008.11.004
Williams, V.V., Williams, R.: Finding, minimizing, and counting weighted subgraphs. SIAM J. Comput. 42(3), 831–854 (2013). doi:10.1137/09076619X
Yoshida, Y., Yamamoto, M., Ito, H.: Improved constant-time approximation algorithms for maximum matchings and other optimization problems. SIAM J. Comput. 41(4), 1074–1093 (2012). doi:10.1137/110828691
Acknowledgements
Aliakbarpour, Gouleakis, Peebles, Rubinfeld and Yodpinyanee were supported by the National Science Foundation Graduate Research Fellowship under Grant No. CCF-1217423, CCF-1065125 and CCF-1420692. Peebles was also supported by Grant No. CCF-1122374. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. In addition, Rubinfeld was supported by the Israel Science Foundation grant 1536/14, and Yodpinyanee was supported by the Development and Promotion of Science and Technology Talents Project scholarship, Royal Thai Government. We thank Dana Ron for her valuable contribution to this paper. We thank Peter Haas and Samuel Madden for helpful discussions. We thank anonymous reviewers for their insightful comments on the preliminary version of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix: Useful Inequalities
Appendix: Useful Inequalities
This section provides standard equalities that we use throughout our paper. These inequalities exist in many variations, but here we only present the formulations which are most convenient for our purposes.
Theorem 10
(Chebyshev’s Inequality) For any random variable X and \(a > 0\),
Theorem 11
(Markov’s Inequality) For any non-negative random variable X and \(a > 0\),
Theorem 12
(Chernoff Bound) Let \(X_1, \ldots , X_n\) be independent Bernoulli random variables such that \(\mathrm{P}[X_i = 1] = p\) for all \(i \in [n]\), and let \(X = \frac{1}{n}\sum _{i=1}^{n} X_i\). Then for any \(0 < \delta \le 1\),
Theorem 13
(Jensen’s Inequality) For any real convex function f with \(x_1, \ldots , x_n\) in its domain,
Rights and permissions
About this article
Cite this article
Aliakbarpour, M., Biswas, A.S., Gouleakis, T. et al. Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling. Algorithmica 80, 668–697 (2018). https://doi.org/10.1007/s00453-017-0287-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-017-0287-3