Abstract
The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extends it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Instead of using the full input data \(\mathscr {D}\).
- 2.
For our discussion, it is enough to state that the sensitivity score is a real value in the half-open interval \([ 0,\infty )\).
- 3.
For CABLR, Huggins et al. proved that \(t(\epsilon ,\delta ):= \lceil \frac{c \bar{m}_N}{\epsilon ^2}[(D + 1)log\, \bar{m}_N + log (\frac{1}{\delta })] \rceil \), where D is the number of features in the input data, \(\bar{m}_N\) is the average sensitivity of the input data and c is a constant. The mentioned trade-off can be appreciated in the definition of t.
- 4.
We loose random access when the input data is so large that accessing some parts of the data is more expensive, computationally speaking, than accessing other parts.
- 5.
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ - last accessed in 4/2021.
- 6.
https://bitbucket.org/jhhuggins/lrcoresets/src/master - last accessed in 2/2020.
- 7.
We carefully distinguish between a coreset and a coreset-based summary. The former requires a theoretical proof on the quality loss.
References
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: StreamKM++: a clustering algorithm for data streams. J. Exp. Alg. (JEA) 17, 2–4 (2012)
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Bachem, O., Lucic, M., Krause, A.: Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017)
Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)
Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. CoRR abs/1612.00889 (2016). http://arxiv.org/abs/1612.00889
Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical report 22(1), 1–5 (1999)
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, pp. 569–578. ACM (2011)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for K-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. SIAM (2013)
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25
Har-Peled, S., Mazumdar, S.: On coresets for k-Means and k-Median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291–300. ACM (2004)
Huggins, J., Campbell, T., Broderick, T.: Coresets for scalable Bayesian logistic regression. In: Advances in Neural Information Processing Systems, pp. 4080–4088 (2016)
Mustafa, N.H., Varadarajan, K.R.: Epsilon-approximations and epsilon-nets. arXiv preprint arXiv:1702.03676 (2017)
Phillips, J.M.: Coresets and sketches. arXiv preprint arXiv:1601.00617 (2016)
Reddi, S.J., Póczos, B., Smola, A.J.: Communication efficient coresets for empirical loss minimization. In: UAI, pp. 752–761 (2015)
Riquelme-Granada, N., Nguyen., K.A., Luo., Z.: On generating efficient data summaries for logistic regression: a coreset-based approach. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 78–89. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009823200780089
Riquelme-Granada, N., Nguyen, K., Luo, Z.: Coreset-based conformal prediction for large-scale learning. In: Conformal and Probabilistic Prediction and Applications, pp. 142–162 (2019)
Riquelme-Granada, N., Nguyen, K.A., Luo, Z.: Fast probabilistic prediction for kernel SVM via enclosing balls. In: Conformal and Probabilistic Prediction and Applications, pp. 189–208. PMLR (2020)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York (2014)
Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends Machine Learn. 4(2), 107–194 (2012)
Zhang, Y., Tangwongsan, K., Tirthapura, S.: Streaming k-means clustering with fast queries. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 449–460. IEEE (2017)
Acknowledgements
This research is supported by AstraZeneca and the Paraguayan Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Riquelme-Granada, N., Nguyen, K.A., Luo, Z. (2021). Coreset-Based Data Compression for Logistic Regression. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-83014-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83013-7
Online ISBN: 978-3-030-83014-4
eBook Packages: Computer ScienceComputer Science (R0)