Skip to main content

Coreset-Based Data Compression for Logistic Regression

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1446))

  • 384 Accesses

Abstract

The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extends it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Instead of using the full input data \(\mathscr {D}\).

  2. 2.

    For our discussion, it is enough to state that the sensitivity score is a real value in the half-open interval \([ 0,\infty )\).

  3. 3.

    For CABLR, Huggins et al. proved that \(t(\epsilon ,\delta ):= \lceil \frac{c \bar{m}_N}{\epsilon ^2}[(D + 1)log\, \bar{m}_N + log (\frac{1}{\delta })] \rceil \), where D is the number of features in the input data, \(\bar{m}_N\) is the average sensitivity of the input data and c is a constant. The mentioned trade-off can be appreciated in the definition of t.

  4. 4.

    We loose random access when the input data is so large that accessing some parts of the data is more expensive, computationally speaking, than accessing other parts.

  5. 5.

    https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ - last accessed in 4/2021.

  6. 6.

    https://bitbucket.org/jhhuggins/lrcoresets/src/master - last accessed in 2/2020.

  7. 7.

    We carefully distinguish between a coreset and a coreset-based summary. The former requires a theoretical proof on the quality loss.

References

  1. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: StreamKM++: a clustering algorithm for data streams. J. Exp. Alg. (JEA) 17, 2–4 (2012)

    MathSciNet  MATH  Google Scholar 

  2. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)

    MathSciNet  MATH  Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  4. Bachem, O., Lucic, M., Krause, A.: Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476 (2017)

  5. Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)

    Article  MathSciNet  Google Scholar 

  6. Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. CoRR abs/1612.00889 (2016). http://arxiv.org/abs/1612.00889

  7. Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical report 22(1), 1–5 (1999)

    Google Scholar 

  8. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240 (2006)

    Google Scholar 

  9. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, pp. 569–578. ACM (2011)

    Google Scholar 

  10. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for K-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. SIAM (2013)

    Google Scholar 

  11. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25

    Chapter  Google Scholar 

  12. Har-Peled, S., Mazumdar, S.: On coresets for k-Means and k-Median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291–300. ACM (2004)

    Google Scholar 

  13. Huggins, J., Campbell, T., Broderick, T.: Coresets for scalable Bayesian logistic regression. In: Advances in Neural Information Processing Systems, pp. 4080–4088 (2016)

    Google Scholar 

  14. Mustafa, N.H., Varadarajan, K.R.: Epsilon-approximations and epsilon-nets. arXiv preprint arXiv:1702.03676 (2017)

  15. Phillips, J.M.: Coresets and sketches. arXiv preprint arXiv:1601.00617 (2016)

  16. Reddi, S.J., Póczos, B., Smola, A.J.: Communication efficient coresets for empirical loss minimization. In: UAI, pp. 752–761 (2015)

    Google Scholar 

  17. Riquelme-Granada, N., Nguyen., K.A., Luo., Z.: On generating efficient data summaries for logistic regression: a coreset-based approach. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 78–89. INSTICC. SciTePress (2020). https://doi.org/10.5220/0009823200780089

  18. Riquelme-Granada, N., Nguyen, K., Luo, Z.: Coreset-based conformal prediction for large-scale learning. In: Conformal and Probabilistic Prediction and Applications, pp. 142–162 (2019)

    Google Scholar 

  19. Riquelme-Granada, N., Nguyen, K.A., Luo, Z.: Fast probabilistic prediction for kernel SVM via enclosing balls. In: Conformal and Probabilistic Prediction and Applications, pp. 189–208. PMLR (2020)

    Google Scholar 

  20. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York (2014)

    Book  Google Scholar 

  21. Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends Machine Learn. 4(2), 107–194 (2012)

    Article  Google Scholar 

  22. Zhang, Y., Tangwongsan, K., Tirthapura, S.: Streaming k-means clustering with fast queries. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 449–460. IEEE (2017)

    Google Scholar 

Download references

Acknowledgements

This research is supported by AstraZeneca and the Paraguayan Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nery Riquelme-Granada .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Riquelme-Granada, N., Nguyen, K.A., Luo, Z. (2021). Coreset-Based Data Compression for Logistic Regression. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83014-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83013-7

  • Online ISBN: 978-3-030-83014-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics