Abstract
We consider the problem of doing fast and reliable estimation of the number z of non-zero entries in a sparse boolean matrix product. This problem has applications in databases and computer algebra.
Let n denote the total number of non-zero entries in the input matrices. We show how to compute a 1±ε approximation of z (with small probability of error) in expected time for any \(\varepsilon> 4/\sqrt[4]{z}\). The previously best estimation algorithm, due to Cohen (J. Comput. Syst. Sci. 53(3):441–453, 1997), uses time . We also present a variant using I/Os in expectation in the cache-oblivious model.
In contrast to these results, the currently best algorithms for computing a sparse boolean matrix product use time ω(n 4/3) (resp. ω(n 4/3/B) I/Os), even if the result matrix is restricted to nonzero entries.
Our algorithm combines the size estimation technique of Bar-Yossef et al. (Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10, 2002) with a particular class of pairwise independent hash functions that allows the sketch of a set of the form to be computed in expected time and I/Os.
We then describe how sampling can be used to maintain (independent) sketches of matrices that allow estimation to be performed in time o(n) if z is sufficiently large. This gives a simpler alternative to the sketching technique of Ganguly et al. (Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270, 2005), and matches a space lower bound shown in that paper.
Finally, we present experiments on real-world data sets that show the accuracy of both our methods to be significantly better than the worst-case analysis predicts.
Similar content being viewed by others
Notes
Readers familiar with the database literature may notice that we consider projections that return a set, i.e., that projection is duplicate eliminating. We also observe that any equi-join followed by a projection can be reduced to the case above, having two variables in each relation and projecting away the single join attribute. Thus, there is no loss of generality in considering this minimal case.
We observe that this is different from the “composable hash functions” used by Ganguly et al. [11].
References
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. ACM, New York (1999). SIGMOD Rec. 28(2), 275–286
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB ’94), pp. 487–499. Morgan Kaufmann, San Mateo (1994)
Alon, N., Spencer, J.: The Probabilistic Method. Wiley, New York (1992). ISBN 0-471-53588-5
Amossen, R.R., Pagh, R.: Faster join-projects and sparse matrix multiplications. In: Proceedings of the 12th International Conference on Database Theory (ICDT ’09), pp. 121–126. ACM, New York (2009)
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10. Springer, Berlin (2002)
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM Symposium on Principles of Database Systems (PODS ’00), pp. 268–279. ACM, New York (2000)
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)
Cohen, E.: Structure prediction and computation of sparse matrix products. J. Comb. Optim. 2(4), 307–332 (1998)
Dor, D., Zwick, U.: Selecting the median. SIAM J. Comput. 28(5), 1722–1758 (1999)
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. J. Assoc. Comput. Mach. 31(3), 538–544 (1984)
Ganguly, S., Garofalakis, M., Kumar, A., Rastogi, R.: Join-distinct aggregate estimation over update streams. In: Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270. ACM, New York (2005)
Ganguly, S., Saha, B.: On estimating path aggregates over streaming graphs. In: Proceedings of 17th International Symposium on Algorithms and Computation (ISAAC ’06). Lecture Notes in Computer Science, vol. 4288, pp. 163–172. Springer, Berlin (2006)
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01), pp. 541–550. Morgan Kaufmann, San Mateo (2001). http://www.vldb.org/conf/2001/P541.pdf
Hoare, C.A.R.: Algorithm 65: find. Commun. ACM 4(7), 321–322 (1961)
Lingas, A.: A fast output-sensitive algorithm for boolean matrix multiplication. Algorithmica 61(1), 36–50 (2011)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995). ISBN 0-521-47465-5
Yuster, R., Zwick, U.: Fast sparse matrix multiplication. ACM Trans. Algorithms 1(1), 2–13 (2005). doi:10.1145/1077464.1077466
Acknowledgements
We would like to thank Jelani Nelson for useful discussions, and in particular for introducing us to the idea of buffering to achieve faster data stream algorithms. Also, we thank Sumit Ganguly for clarifying the lower bound proof of [11] to us. Finally, we thank Konstantin Kutzkov and Rolf Fagerberg for pointing out mistakes that have been corrected in this version of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Danish National Research Foundation, as part of the project “Scalable Query Evaluation in Relational Database Systems”.
Rights and permissions
About this article
Cite this article
Amossen, R.R., Campagna, A. & Pagh, R. Better Size Estimation for Sparse Matrix Products. Algorithmica 69, 741–757 (2014). https://doi.org/10.1007/s00453-012-9692-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-012-9692-9