Better Size Estimation for Sparse Matrix Products

Amossen, Rasmus Resen; Campagna, Andrea; Pagh, Rasmus

doi:10.1007/s00453-012-9692-9

Better Size Estimation for Sparse Matrix Products

Published: 01 March 2013

Volume 69, pages 741–757, (2014)
Cite this article

Algorithmica Aims and scope Submit manuscript

Rasmus Resen Amossen¹,
Andrea Campagna² &
Rasmus Pagh²

345 Accesses
6 Citations
Explore all metrics

Abstract

We consider the problem of doing fast and reliable estimation of the number z of non-zero entries in a sparse boolean matrix product. This problem has applications in databases and computer algebra.

Let n denote the total number of non-zero entries in the input matrices. We show how to compute a 1±ε approximation of z (with small probability of error) in expected time for any \(\varepsilon> 4/\sqrt[4]{z}\). The previously best estimation algorithm, due to Cohen (J. Comput. Syst. Sci. 53(3):441–453, 1997), uses time . We also present a variant using I/Os in expectation in the cache-oblivious model.

In contrast to these results, the currently best algorithms for computing a sparse boolean matrix product use time ω(n ^4/3) (resp. ω(n ^4/3/B) I/Os), even if the result matrix is restricted to nonzero entries.

Our algorithm combines the size estimation technique of Bar-Yossef et al. (Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10, 2002) with a particular class of pairwise independent hash functions that allows the sketch of a set of the form to be computed in expected time and I/Os.

We then describe how sampling can be used to maintain (independent) sketches of matrices that allow estimation to be performed in time o(n) if z is sufficiently large. This gives a simpler alternative to the sketching technique of Ganguly et al. (Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270, 2005), and matches a space lower bound shown in that paper.

Finally, we present experiments on real-world data sets that show the accuracy of both our methods to be significantly better than the worst-case analysis predicts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Art of Shaving Logs

A Note on Deterministic Poly-Time Algorithms for Partition Functions Associated with Boolean Matrices with Prescribed Row and Column Sums

Randomized Algorithms for Low-Rank Matrix Factorizations: Sharp Performance Bounds

Article 24 May 2014

Notes

Readers familiar with the database literature may notice that we consider projections that return a set, i.e., that projection is duplicate eliminating. We also observe that any equi-join followed by a projection can be reduced to the case above, having two variables in each relation and projecting away the single join attribute. Thus, there is no loss of generality in considering this minimal case.
We observe that this is different from the “composable hash functions” used by Ganguly et al. [11].
http://fimi.cs.helsinki.fi.
http://www.stat.fsu.edu/pub/diehard/.

References

Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. ACM, New York (1999). SIGMOD Rec. 28(2), 275–286
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB ’94), pp. 487–499. Morgan Kaufmann, San Mateo (1994)
Google Scholar
Alon, N., Spencer, J.: The Probabilistic Method. Wiley, New York (1992). ISBN 0-471-53588-5
MATH Google Scholar
Amossen, R.R., Pagh, R.: Faster join-projects and sparse matrix multiplications. In: Proceedings of the 12th International Conference on Database Theory (ICDT ’09), pp. 121–126. ACM, New York (2009)
Chapter Google Scholar
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM ’02), pp. 1–10. Springer, Berlin (2002)
Chapter Google Scholar
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of the 19th ACM Symposium on Principles of Database Systems (PODS ’00), pp. 268–279. ACM, New York (2000)
Google Scholar
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)
Article MATH Google Scholar
Cohen, E.: Structure prediction and computation of sparse matrix products. J. Comb. Optim. 2(4), 307–332 (1998)
Article Google Scholar
Dor, D., Zwick, U.: Selecting the median. SIAM J. Comput. 28(5), 1722–1758 (1999)
Article MATH MathSciNet Google Scholar
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. J. Assoc. Comput. Mach. 31(3), 538–544 (1984)
Article MATH MathSciNet Google Scholar
Ganguly, S., Garofalakis, M., Kumar, A., Rastogi, R.: Join-distinct aggregate estimation over update streams. In: Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS ’05), pp. 259–270. ACM, New York (2005)
Google Scholar
Ganguly, S., Saha, B.: On estimating path aggregates over streaming graphs. In: Proceedings of 17th International Symposium on Algorithms and Computation (ISAAC ’06). Lecture Notes in Computer Science, vol. 4288, pp. 163–172. Springer, Berlin (2006)
Google Scholar
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01), pp. 541–550. Morgan Kaufmann, San Mateo (2001). http://www.vldb.org/conf/2001/P541.pdf
Google Scholar
Hoare, C.A.R.: Algorithm 65: find. Commun. ACM 4(7), 321–322 (1961)
Article Google Scholar
Lingas, A.: A fast output-sensitive algorithm for boolean matrix multiplication. Algorithmica 61(1), 36–50 (2011)
Article MATH MathSciNet Google Scholar
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995). ISBN 0-521-47465-5
Book MATH Google Scholar
Yuster, R., Zwick, U.: Fast sparse matrix multiplication. ACM Trans. Algorithms 1(1), 2–13 (2005). doi:10.1145/1077464.1077466
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank Jelani Nelson for useful discussions, and in particular for introducing us to the idea of buffering to achieve faster data stream algorithms. Also, we thank Sumit Ganguly for clarifying the lower bound proof of [11] to us. Finally, we thank Konstantin Kutzkov and Rolf Fagerberg for pointing out mistakes that have been corrected in this version of the paper.

Author information

Authors and Affiliations

Slåenhøj 36, 2990, Nivå, Denmark
Rasmus Resen Amossen
IT University of Copenhagen, 2300, Copenhagen S, Denmark
Andrea Campagna & Rasmus Pagh

Authors

Rasmus Resen Amossen
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Campagna
View author publications
You can also search for this author in PubMed Google Scholar
Rasmus Pagh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Campagna.

Additional information

This work was supported by the Danish National Research Foundation, as part of the project “Scalable Query Evaluation in Relational Database Systems”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amossen, R.R., Campagna, A. & Pagh, R. Better Size Estimation for Sparse Matrix Products. Algorithmica 69, 741–757 (2014). https://doi.org/10.1007/s00453-012-9692-9

Download citation

Received: 28 May 2011
Accepted: 20 September 2012
Published: 01 March 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s00453-012-9692-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Better Size Estimation for Sparse Matrix Products

Abstract

Access this article

Similar content being viewed by others

The Art of Shaving Logs

A Note on Deterministic Poly-Time Algorithms for Partition Functions Associated with Boolean Matrices with Prescribed Row and Column Sums

Randomized Algorithms for Low-Rank Matrix Factorizations: Sharp Performance Bounds

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Better Size Estimation for Sparse Matrix Products

Abstract

Access this article

Similar content being viewed by others

The Art of Shaving Logs

A Note on Deterministic Poly-Time Algorithms for Partition Functions Associated with Boolean Matrices with Prescribed Row and Column Sums

Randomized Algorithms for Low-Rank Matrix Factorizations: Sharp Performance Bounds

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation