Skip to main content

Faster Coreset Construction for Projective Clustering via Low-Rank Approximation

  • 520 Accesses

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 10979)

Abstract

In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let \(A \in \mathbb {R}^{n\times d}\) be an input matrix. An earlier deterministic coreset construction of Feldman et. al. [10] relied on computing the SVD of A. The best known algorithms for SVD require \(\min \{nd^2, n^2d\}\) time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.

Keywords

  • Coreset Construction
  • Projective Clustering
  • Singular Vectors
  • Dimension Reduction Step
  • Linear Algebra Background

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

R. Pratap—This work done when author was affiliated with TCS Innovation Labs.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-94667-2_28
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-94667-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)

Notes

  1. 1.

    Here, \(\tilde{O}\) is the asymptotic notation that ignores logarithmic factors.

  2. 2.

    Two passes are required as we first multiply A on the right with a Johnson-Lindenstrauss matrix \(\mathcal {S}\), and then we project the rows of A again onto the row span of \(\mathcal {S}A\).

References

  1. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)

    MathSciNet  CrossRef  Google Scholar 

  2. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. In: Welzl, E., (ed.) Current Trends in Combinatorial and Computational Geometry (2007)

    Google Scholar 

  3. Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, 4–6 January 2009, pp. 968–977 (2009)

    CrossRef  Google Scholar 

  4. Clarkson, K.L., Woodruff, D.P.: Low rank approximation and regression in input sparsity time. In: Symposium on Theory of Computing Conference, STOC 2013, Palo Alto, CA, USA, 1–4 June 2013, pp. 81–90 (2013)

    Google Scholar 

  5. Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)

    Google Scholar 

  6. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, USA, pp. 226–231 (1996)

    Google Scholar 

  7. Feldman, D., Fiat, A., Sharir, M.: Coresets forweighted facilities and their applications. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 315–324 (2006)

    Google Scholar 

  8. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)

    Google Scholar 

  9. Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)

    CrossRef  Google Scholar 

  10. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013)

    CrossRef  Google Scholar 

  11. Har-Peled, S.: No, coreset, no cry. In: Lodaya, K., Mahajan, M. (eds.) FSTTCS 2004: Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 3328, pp. 324–335. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30538-5_27

    CrossRef  Google Scholar 

  12. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)

    Google Scholar 

  13. Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: Proceedings of a DIMACS Workshop on External Memory Algorithms, New Brunswick, New Jersey, USA, 20–22 May 1998, pp. 107–118 (1998)

    Google Scholar 

  14. Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, Egypt, pp. 506–515 (2000)

    Google Scholar 

  15. Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  16. Phillips, J.M.: Coresets and sketches. CoRR, abs/1601.00617 (2016)

    Google Scholar 

  17. Sarlós, T.: Improved approximation algorithms for large matrices via random projections. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 143–152 (2006)

    Google Scholar 

  18. Varadarajan, K.R., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, 17–19 January 2012, pp. 1329–1342 (2012)

    CrossRef  Google Scholar 

  19. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rameshwar Pratap .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Pratap, R., Sen, S. (2018). Faster Coreset Construction for Projective Clustering via Low-Rank Approximation. In: Iliopoulos, C., Leong, H., Sung, WK. (eds) Combinatorial Algorithms. IWOCA 2018. Lecture Notes in Computer Science(), vol 10979. Springer, Cham. https://doi.org/10.1007/978-3-319-94667-2_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94667-2_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94666-5

  • Online ISBN: 978-3-319-94667-2

  • eBook Packages: Computer ScienceComputer Science (R0)