Skip to main content
Log in

Better Size Estimation for Sparse Matrix Products

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We consider the problem of doing fast and reliable estimation of the number z of non-zero entries in a sparse boolean matrix product. This problem has applications in databases and computer algebra.

Let n denote the total number of non-zero entries in the input matrices. We show how to compute a 1±ε approximation of z (with small probability of error) in expected time for any \(\varepsilon> 4/\sqrt[4]{z}\). The previously best estimation algorithm, due to Cohen (J. Comput. Syst. Sci. 53(3):441–453, 1997), uses time . We also present a variant using I/Os in expectation in the cache-oblivious model.

In contrast to these results, the currently best algorithms for computing a sparse boolean matrix product use time ω(n 4/3) (resp. ω(n 4/3/B) I/Os), even if the result matrix is restricted to nonzero entries.

Our algorithm combines the size estimation technique of Bar-Yossef et al. (Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10, 2002) with a particular class of pairwise independent hash functions that allows the sketch of a set of the form to be computed in expected time and I/Os.

We then describe how sampling can be used to maintain (independent) sketches of matrices that allow estimation to be performed in time o(n) if z is sufficiently large. This gives a simpler alternative to the sketching technique of Ganguly et al. (Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270, 2005), and matches a space lower bound shown in that paper.

Finally, we present experiments on real-world data sets that show the accuracy of both our methods to be significantly better than the worst-case analysis predicts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Readers familiar with the database literature may notice that we consider projections that return a set, i.e., that projection is duplicate eliminating. We also observe that any equi-join followed by a projection can be reduced to the case above, having two variables in each relation and projecting away the single join attribute. Thus, there is no loss of generality in considering this minimal case.

  2. We observe that this is different from the “composable hash functions” used by Ganguly et al. [11].

  3. http://fimi.cs.helsinki.fi.

  4. http://www.stat.fsu.edu/pub/diehard/.

References

  1. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. ACM, New York (1999). SIGMOD Rec. 28(2), 275–286

    Google Scholar 

  2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB ’94), pp. 487–499. Morgan Kaufmann, San Mateo (1994)

    Google Scholar 

  3. Alon, N., Spencer, J.: The Probabilistic Method. Wiley, New York (1992). ISBN 0-471-53588-5

    MATH  Google Scholar 

  4. Amossen, R.R., Pagh, R.: Faster join-projects and sparse matrix multiplications. In: Proceedings of the 12th International Conference on Database Theory (ICDT ’09), pp. 121–126. ACM, New York (2009)

    Chapter  Google Scholar 

  5. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10. Springer, Berlin (2002)

    Chapter  Google Scholar 

  6. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM Symposium on Principles of Database Systems (PODS ’00), pp. 268–279. ACM, New York (2000)

    Google Scholar 

  7. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)

    Article  MATH  Google Scholar 

  8. Cohen, E.: Structure prediction and computation of sparse matrix products. J. Comb. Optim. 2(4), 307–332 (1998)

    Article  Google Scholar 

  9. Dor, D., Zwick, U.: Selecting the median. SIAM J. Comput. 28(5), 1722–1758 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  10. Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. J. Assoc. Comput. Mach. 31(3), 538–544 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  11. Ganguly, S., Garofalakis, M., Kumar, A., Rastogi, R.: Join-distinct aggregate estimation over update streams. In: Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270. ACM, New York (2005)

    Google Scholar 

  12. Ganguly, S., Saha, B.: On estimating path aggregates over streaming graphs. In: Proceedings of 17th International Symposium on Algorithms and Computation (ISAAC ’06). Lecture Notes in Computer Science, vol. 4288, pp. 163–172. Springer, Berlin (2006)

    Google Scholar 

  13. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01), pp. 541–550. Morgan Kaufmann, San Mateo (2001). http://www.vldb.org/conf/2001/P541.pdf

    Google Scholar 

  14. Hoare, C.A.R.: Algorithm 65: find. Commun. ACM 4(7), 321–322 (1961)

    Article  Google Scholar 

  15. Lingas, A.: A fast output-sensitive algorithm for boolean matrix multiplication. Algorithmica 61(1), 36–50 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  16. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995). ISBN 0-521-47465-5

    Book  MATH  Google Scholar 

  17. Yuster, R., Zwick, U.: Fast sparse matrix multiplication. ACM Trans. Algorithms 1(1), 2–13 (2005). doi:10.1145/1077464.1077466

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank Jelani Nelson for useful discussions, and in particular for introducing us to the idea of buffering to achieve faster data stream algorithms. Also, we thank Sumit Ganguly for clarifying the lower bound proof of [11] to us. Finally, we thank Konstantin Kutzkov and Rolf Fagerberg for pointing out mistakes that have been corrected in this version of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Campagna.

Additional information

This work was supported by the Danish National Research Foundation, as part of the project “Scalable Query Evaluation in Relational Database Systems”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amossen, R.R., Campagna, A. & Pagh, R. Better Size Estimation for Sparse Matrix Products. Algorithmica 69, 741–757 (2014). https://doi.org/10.1007/s00453-012-9692-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-012-9692-9

Keywords

Navigation