Coreset-Based Data Compression for Logistic Regression

Riquelme-Granada, Nery; Nguyen, Khuong An; Luo, Zhiyuan

doi:10.1007/978-3-030-83014-4_10

Nery Riquelme-Granada⁸,
Khuong An Nguyen⁹ &
Zhiyuan Luo⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1446))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

384 Accesses

Abstract

The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extends it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Instead of using the full input data \(\mathscr {D}\).
2.
For our discussion, it is enough to state that the sensitivity score is a real value in the half-open interval \([ 0,\infty )\).
3.
For CABLR, Huggins et al. proved that \(t(\epsilon ,\delta ):= \lceil \frac{c \bar{m}_N}{\epsilon ^2}[(D + 1)log\, \bar{m}_N + log (\frac{1}{\delta })] \rceil \), where D is the number of features in the input data, \(\bar{m}_N\) is the average sensitivity of the input data and c is a constant. The mentioned trade-off can be appreciated in the definition of t.
4.
We loose random access when the input data is so large that accessing some parts of the data is more expensive, computationally speaking, than accessing other parts.
5.
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ - last accessed in 4/2021.
6.
https://bitbucket.org/jhhuggins/lrcoresets/src/master - last accessed in 2/2020.
7.
We carefully distinguish between a coreset and a coreset-based summary. The former requires a theoretical proof on the quality loss.

References

Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: StreamKM++: a clustering algorithm for data streams. J. Exp. Alg. (JEA) 17, 2–4 (2012)
MathSciNet MATH Google Scholar
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
MathSciNet MATH Google Scholar
Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Bachem, O., Lucic, M., Krause, A.: Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017)
Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)
Article MathSciNet Google Scholar
Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. CoRR abs/1612.00889 (2016). http://arxiv.org/abs/1612.00889
Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical report 22(1), 1–5 (1999)
Google Scholar
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006)
Google Scholar
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, pp. 569–578. ACM (2011)
Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for K-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. SIAM (2013)
Google Scholar
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25
Chapter Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-Means and k-Median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291–300. ACM (2004)
Google Scholar
Huggins, J., Campbell, T., Broderick, T.: Coresets for scalable Bayesian logistic regression. In: Advances in Neural Information Processing Systems, pp. 4080–4088 (2016)
Google Scholar
Mustafa, N.H., Varadarajan, K.R.: Epsilon-approximations and epsilon-nets. arXiv preprint arXiv:1702.03676 (2017)
Phillips, J.M.: Coresets and sketches. arXiv preprint arXiv:1601.00617 (2016)
Reddi, S.J., Póczos, B., Smola, A.J.: Communication efficient coresets for empirical loss minimization. In: UAI, pp. 752–761 (2015)
Google Scholar
Riquelme-Granada, N., Nguyen., K.A., Luo., Z.: On generating efficient data summaries for logistic regression: a coreset-based approach. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 78–89. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009823200780089
Riquelme-Granada, N., Nguyen, K., Luo, Z.: Coreset-based conformal prediction for large-scale learning. In: Conformal and Probabilistic Prediction and Applications, pp. 142–162 (2019)
Google Scholar
Riquelme-Granada, N., Nguyen, K.A., Luo, Z.: Fast probabilistic prediction for kernel SVM via enclosing balls. In: Conformal and Probabilistic Prediction and Applications, pp. 189–208. PMLR (2020)
Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York (2014)
Book Google Scholar
Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends Machine Learn. 4(2), 107–194 (2012)
Article Google Scholar
Zhang, Y., Tangwongsan, K., Tirthapura, S.: Streaming k-means clustering with fast queries. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 449–460. IEEE (2017)
Google Scholar

Download references

Acknowledgements

This research is supported by AstraZeneca and the Paraguayan Government.

Author information

Authors and Affiliations

Royal Holloway University of London, Surrey, TW20 0EX, UK
Nery Riquelme-Granada & Zhiyuan Luo
University of Brighton, East Sussex, BN2 4AT, UK
Khuong An Nguyen

Authors

Nery Riquelme-Granada
View author publications
You can also search for this author in PubMed Google Scholar
Khuong An Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nery Riquelme-Granada .

Editor information

Editors and Affiliations

MODESTE/ESEO, Angers, France
Slimane Hammoudi
Fraunhofer FIT and RWTH Aachen University, Aachen, Germany
Christoph Quix
University of Coimbra, Coimbra, Portugal
Jorge Bernardino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riquelme-Granada, N., Nguyen, K.A., Luo, Z. (2021). Coreset-Based Data Compression for Logistic Regression. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-83014-4_10
Published: 23 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83013-7
Online ISBN: 978-3-030-83014-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics