Techniques for Clustering Massive Data Sets

  • Sudipto Guha
  • Rajeev Rastogi
  • Kyuseok Shim
Part of the Network Theory and Applications book series (NETA, volume 11)

Abstract

The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters.

Keywords

Sugar Marketing Stein Alan Meric 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.Google Scholar
  2. [2]
    N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.Google Scholar
  3. [3]
    Vijay Arya, Naveen Garg, Rohit Khandekar, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 21–29, 2001.Google Scholar
  4. Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of the International Conference on Very Large Databases (VLDB), pages 490–501, 1995.Google Scholar
  5. [5]
    Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45 (6): 891–923, 1998.MathSciNetMATHCrossRefGoogle Scholar
  6. Pankaj K. Agarwal and Cecilia Procopiuc. Approximation algorithms for projective clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 538–547, 2000.Google Scholar
  7. [7]
    Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k -medians and related problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 106–113, 1998.Google Scholar
  8. [8]
    Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60 (3): 630–659, 2000.MathSciNetMATHCrossRefGoogle Scholar
  9. [9]
    Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. Proceedings of the Symposium on Theory of Computing (STOC), 2001.Google Scholar
  10. [10]
    N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 322–331, 1990.Google Scholar
  11. [11]
    A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), 1999.Google Scholar
  12. [12]
    Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES ‘87), pages 21–29. IEEE Computer Society, 1998.Google Scholar
  13. [13]
    Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In ACM Symposium on Theory of Computing, pages 626–635, 1997.Google Scholar
  14. [14]
    Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science, pages 378–388, 1999.Google Scholar
  15. [15]
    M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant factor approximation algorithm for the k-median problem. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, 1999.Google Scholar
  16. [16]
    M. Charikar Approximation algorithms for clustering problems. PhD Thesis, Stanford University, 2000.Google Scholar
  17. [17]
    F. Chudak. Improved approximation algorithms for uncapacitated facility location. Proceedings of Integer Programming and Combinatorial Optimization, LNCS 1412: 180–194, 1998.MathSciNetGoogle Scholar
  18. Moses Charikar, Samir Khullera, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 642–651, 2001.Google Scholar
  19. [19]
    D. Cutting, D. Karger, Jan Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. SIGIR, 1992.Google Scholar
  20. K. L. Clarkson. A randomized algorithm for closestpoint queries. SIAM Journal on Computing, 17, 1988.Google Scholar
  21. [21]
    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. MIT Press, 2001.Google Scholar
  22. [22]
    H. S. M. Coxeter. An upper bound for the number of equal nonoverlapping speheres that can touch each another of the same size. Symposia in Pure Mathematics, 7: 53–71, 1964.Google Scholar
  23. Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. Proceedings of the Symposium on Theory of Computing (STOC), pages 1–10, 2001.Google Scholar
  24. [24]
    C. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Proceedings of the Symposium on Theory of Computing (STOL), 1987.Google Scholar
  25. [25]
    R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.MATHGoogle Scholar
  26. [26]
    P. Drineas, R. Kannan, A. Frieze, and V. Vinay. Clustering in large graphs and matrices. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 1999.Google Scholar
  27. [27]
    M. Ester, H. Kriegel, J. Snader, and X. Xu. A density-based algorithm for discovering clusters in large spatial database with noise. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), 1996.Google Scholar
  28. [28]
    M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), 1995.Google Scholar
  29. [29]
    J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.MATHCrossRefGoogle Scholar
  30. Toms Feder and Daniel H. Greene. Optimal algorithms for appropriate clustering. Proceedings of the Symposium on Theory of Computing (STOC), pages 434–444, 1988.Google Scholar
  31. [31]
    V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS—Clustering categorical data using summaries. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-99), 1999.Google Scholar
  32. [32]
    D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. Proceedings of the 24’th International Conference on Very Large Data Bases, 1998.Google Scholar
  33. S. Guha, H. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate xml joins. Proceedings of the Symposium on Management of Data (SIGMOD), pages 287–298, 2002.Google Scholar
  34. S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 649–657, 1998.Google Scholar
  35. [35]
    S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.Google Scholar
  36. [36]
    S. Guha, R. Rastogi, and K. Shim. CURE: An efficient algorithm for clustering large databases. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.Google Scholar
  37. [37]
    S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE, 1999.Google Scholar
  38. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, pages 293–306, 1985.Google Scholar
  39. [39]
    Sudipto Guha. Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University, 2000.Google Scholar
  40. [40]
    E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. Technical report, 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.Google Scholar
  41. [41]
    D. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Math of Operations Research, 10 (2): 180–184, 1985.MathSciNetMATHCrossRefGoogle Scholar
  42. [42]
    P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proceedings of the Symposium on Theory of Computing (STOC), 1998.Google Scholar
  43. Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vem-pala. Locality-preserving hashing in multidimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 618–625, 1997.Google Scholar
  44. [44]
    P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the Symposium on Theory of Computing, 1999.Google Scholar
  45. Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 154–159, 1999.Google Scholar
  46. [46]
    A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.Google Scholar
  47. [47]
    K. Jain, M. Mandian, and A. Saberi. A new greedy approach for facility location problem. Proceedings of the Symposium on Theory of Computing (STOC), 2002.Google Scholar
  48. [48]
    K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. Proceedings of the Twenty-Ninth Annual IEEE Symposium on Foundations of Computer Science, 1999.Google Scholar
  49. [49]
    George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. Proceedings of the ACM/IEEE Design Automation Conference, 1997.Google Scholar
  50. O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-media ns. SIAM Journal on Applied Mathematics, pages 539–560, 1979.Google Scholar
  51. Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 614–623, 1998.Google Scholar
  52. S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Proc. 7th European Symposium on Algorithms, pages 378–389, 1999.Google Scholar
  53. Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 367–377, 2000.Google Scholar
  54. [54]
    J. H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44: 245–249, 1992.MathSciNetMATHCrossRefGoogle Scholar
  55. [55]
    J. H. Lin and J. S. Vitter. c-approximations with minimum packing constraint violations. Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, 1992.Google Scholar
  56. [56]
    O. L. Managasarian. Mathematical programming in data mining. Data Mining and Knowledge Discovery, 1997.Google Scholar
  57. [57]
    P. Mirchandani and R. Francis, editors. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990.MATHGoogle Scholar
  58. [58]
    Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001.Google Scholar
  59. [59]
    R. Mettu and C. G. Plaxton. The onlike median problem. Proceedings of the 41st IEEE Foundations of Computer Science, 2000.Google Scholar
  60. [60]
    Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Manuscript, 2002.Google Scholar
  61. [61]
    R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.Google Scholar
  62. [62]
    S. Muthukrishnan. Efficient algorithms for document retrieval problems. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.Google Scholar
  63. [63]
    Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20’th International Conference on Very Large Data Bases, 1994.Google Scholar
  64. [64]
    C. F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, 1993.Google Scholar
  65. [65]
    Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of ICDE, 2002.Google Scholar
  66. [66]
    Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.Google Scholar
  67. [67]
    Cecilia Procopiuc, Michael Jones, Pankaj K. Agarwal, and T. M. Murali. A monte cario algorithm for fast projective clustering. Proceedings of the Symposium on Management of Data (SIGMOD), 2002.Google Scholar
  68. Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison Wesley, 1990.Google Scholar
  69. T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: a dynamic index for multi-dimensional objects. Proceedings of the 13th International Conference on Very Large Data Bases, pages 507–518, 1987.Google Scholar
  70. Kyuseok Shim, Ramakrishnan Srikant, and Rakesh Agrawal. High-dimensional similarity joins. pages 301–311, 1997.Google Scholar
  71. D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 265–274, 1997.Google Scholar
  72. Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. ICALP, pages 249–260, 2001.Google Scholar
  73. [73]
    H. Toivonen. Samping large databases for association rules. Proceedings of the International Conference on Very Large Databases (VLDB), 1996.Google Scholar
  74. Vijay Vazirani. Approximation Algorithms. Springer Verlag, 2001.Google Scholar
  75. [75]
    J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1): 37–57, 1985.MathSciNetMATHCrossRefGoogle Scholar
  76. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.Google Scholar
  77. [77]
    K. Zhang and D. Sasha. Tree pattern matching. In Apocolisto and Galil, editors, Pattern Matching Algorithms. Oxford University Press, 1997.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Sudipto Guha
    • 1
  • Rajeev Rastogi
    • 2
  • Kyuseok Shim
    • 3
  1. 1.Department of Computer Information SciencesUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Bell LaboratiesLucent TechnologiesMurray HillUSA
  3. 3.School of Electrical Engineering and Computer ScienceSeoul National UniversitySeoulKorea

Personalised recommendations