Skip to main content

Small Space Representations for Metric Min-Sum k-Clustering and Their Applications

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4393))

Abstract

The min-sum k -clustering problem is to partition a metric space (P,d) into k clusters \(C_1, \ldots,C_k \subseteq P\) such that \(\sum_{i=1}^k \sum_{p,q \in C_i} d(p,q)\) is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. Using our coresets we obtain three main algorithmic results.

The first result is a sublinear time (4 + ε)-approximation algorithm for the min-sum k-clustering problem in metric spaces. The running time of this algorithm is \(\widetilde{O}(n)\) for any constant k and ε, and it is o(n 2) for all k = o(logn/loglogn). Since the description size of the input is Θ(n 2), this is sublinear in the input size.

Our second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, i.e., an algorithm that uses \({\mathit{poly}}(\log n, k)\) space and makes 2 passes over the input point set arriving as a data stream.

Our third result is a sublinear-time polylogarithmic-factor- approximation algorithm for the min-sum k-clustering problem for arbitrary values of k.

To develop the coresets, we introduce the concept of α-preserving metric embeddings. Such an embedding satisfies properties that (a the distance between any pair of points does not decrease, and (b) the cost of an optimal solution for the considered problem on input (P,d′) is within a constant factor of the optimal solution on input (P,d). In other words, the idea is find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of α with respect to a certain problem. We believe that this concept is an interesting generalization of coresets.

Research supported in part by NSF ITR grant CCR-0313219, by EPSRC grant EP/D063191/1, and by DFG grant Me 872/8-3.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bădoiu, M., et al.: Facility location in sublinear time. In: Caires, L., et al. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 866–877. Springer, Heidelberg (2005)

    Google Scholar 

  2. Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: STOC, pp. 250–257 (2002)

    Google Scholar 

  3. Bartal, Y.: On approximating arbitrary metrics by tree metrics. In: STOC, pp. 161–168 (1998)

    Google Scholar 

  4. Charikar, M., et al.: Incremental clustering and dynamic information retrieval. In: STOC, pp. 626–635 (1997)

    Google Scholar 

  5. Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: STOC, pp. 30–39 (2003)

    Google Scholar 

  6. Chen, K.: On k-median clustering in high dimensions. In: SODA, pp. 1177–1185 (2006)

    Google Scholar 

  7. Czumaj, A., Sohler, C.: Abstract combinatorial programs and efficient property testers. SICOMP 34(3), 580–615 (2005)

    MATH  MathSciNet  Google Scholar 

  8. de la Vega, W.F., et al.: Approximation schemes for clustering problems. In: STOC, pp. 50–58 (2003)

    Google Scholar 

  9. Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: STOC, pp. 209–217 (2005)

    Google Scholar 

  10. Guha, S., et al.: Clustering data streams. In: FOCS, pp. 359–366 (2000)

    Google Scholar 

  11. Gutmann-Beck, N., Hassin, R.: Approximation algorithms for min-sum p-clustering. Discrete Applied Mathematics 89, 125–142 (1998)

    Article  MathSciNet  Google Scholar 

  12. Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-medians and their applications. In: STOC, pp. 291–300 (2004)

    Google Scholar 

  13. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: SoCG, pp. 126–134 (2005)

    Google Scholar 

  14. Indyk, P.: Sublinear time algorithms for metric space problems. In: STOC, pp. 428–434 (1999)

    Google Scholar 

  15. Indyk, P.: High-Dimensional Computational Geometry. PhD thesis, Stanford (2000)

    Google Scholar 

  16. Indyk, P.: Algorithms for dynamic geometric problems over data streams. In: STOC, pp. 373–380 (2004)

    Google Scholar 

  17. Indyk, P., Matoušek, J.: Low-distortion embeddings of finite metric spaces. In: Handbook of Discrete and Computational Geometry, 2nd edn., pp. 177–196 (2004)

    Google Scholar 

  18. Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In: FOCS, pp. 454–462 (2004)

    Google Scholar 

  19. Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: Caires, L., et al. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1374–1385. Springer, Heidelberg (2005)

    Google Scholar 

  20. Mettu, R., Plaxton, G.: Optimal time bounds for approximate clustering. Machine Learning 56(1-3), 35–60 (2004)

    Article  MATH  Google Scholar 

  21. Meyerson, A., O’Callaghan, L., Plotkin, S.: A k-median algorithm with running time independent of data size. Machine Learning 56(1–3), 61–87 (2004)

    Article  MATH  Google Scholar 

  22. Sahni, S., Gonzalez, T.: P-complete approximation problems. JACM 23, 555–566 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  23. Schulman, L.J.: Clustering for edge-cost minimization. In: STOC, pp. 547–555 (2000)

    Google Scholar 

  24. Thorup, M.: Quick k-median, k-center, and facility location for sparse graphs. SICOMP 34(2), 405–432 (2005)

    MATH  MathSciNet  Google Scholar 

  25. Tokuyama, T., Nakano, J.: Geometric algorithms for the minimum cost assignment problem. Random Structures and Algorithms 6(4), 393–406 (1995)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Wolfgang Thomas Pascal Weil

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Czumaj, A., Sohler, C. (2007). Small Space Representations for Metric Min-Sum k-Clustering and Their Applications. In: Thomas, W., Weil, P. (eds) STACS 2007. STACS 2007. Lecture Notes in Computer Science, vol 4393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70918-3_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70918-3_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70917-6

  • Online ISBN: 978-3-540-70918-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics