Clustering and Information Retrieval pp 35-82 | Cite as

# Techniques for Clustering Massive Data Sets

Chapter

## Abstract

The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of *knowledge discovery* and *data mining*. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given *n* data points in a *d*-dimensional metric space, partition the data points into *k* clusters such that the data points within a cluster are more similar to each other than data points in different clusters.

## Keywords

Cluster Algorithm Massive Data Facility Location Problem Categorical Attribute Representative Point
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## Preview

Unable to display preview. Download preview PDF.

## References

- [1]R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining.
*Proceedings of the Symposium on Management of Data (SIGMOD)*, 1998.Google Scholar - [2]N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering.
*Proceedings of the Symposium on Foundations of Computer Science (FOCS)*, 2000.Google Scholar - [3]Vijay Arya, Naveen Garg, Rohit Khandekar, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 21–29, 2001.Google Scholar - Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases.
*Proceedings of the International Conference on Very Large Databases (VLDB)*, pages 490–501, 1995.Google Scholar - [5]Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions.
*Journal of the ACM*, 45 (6): 891–923, 1998.MathSciNetzbMATHCrossRefGoogle Scholar - Pankaj K. Agarwal and Cecilia Procopiuc. Approximation algorithms for projective clustering.
*Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pages 538–547, 2000.Google Scholar - [7]Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k -medians and related problems. In
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 106–113, 1998.Google Scholar - [8]Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations.
*Journal of Computer and System Sciences*, 60 (3): 630–659, 2000.MathSciNetzbMATHCrossRefGoogle Scholar - [9]Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum
*k-*clustering in metric spaces.*Proceedings of the Symposium on Theory of Computing (STOC)*, 2001.Google Scholar - [10]N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The
*R**-tree: an efficient and robust access method for points and rectangles.*Proceedings of the ACM SIGMOD Conference on Management of Data*, pages 322–331, 1990.Google Scholar - [11]A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces.
*Proceedings of the Symposium on Theory of Computing (STOC)*, 1999.Google Scholar - [12]Andrei Z. Broder. On the resemblance and containment of documents. In
*Compression and Complexity of Sequences (SEQUENCES ‘87)*, pages 21–29. IEEE Computer Society, 1998.Google Scholar - [13]Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In
*ACM Symposium on Theory of Computing*, pages 626–635, 1997.Google Scholar - [14]Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In
*IEEE Symposium on Foundations of Computer Science*, pages 378–388, 1999.Google Scholar - [15]M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant factor approximation algorithm for the k-median problem.
*Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing*, 1999.Google Scholar - [16]M. Charikar Approximation algorithms for clustering problems.
*PhD Thesis, Stanford University*, 2000.Google Scholar - [17]F. Chudak. Improved approximation algorithms for uncapacitated facility location. Proceedings of Integer
*Programming and Combinatorial Optimization*, LNCS 1412: 180–194, 1998.MathSciNetGoogle Scholar - Moses Charikar, Samir Khullera, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers.
*Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pages 642–651, 2001.Google Scholar - [19]D. Cutting, D. Karger, Jan Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections.
*SIGIR*, 1992.Google Scholar - K. L. Clarkson. A randomized algorithm for closestpoint queries.
*SIAM Journal on Computing*, 17, 1988.Google Scholar - [21]Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
*Introduction to Algorithms, 2nd ed*. MIT Press, 2001.Google Scholar - [22]H. S. M. Coxeter. An upper bound for the number of equal nonoverlapping speheres that can touch each another of the same size.
*Symposia in Pure Mathematics*, 7: 53–71, 1964.Google Scholar - Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters.
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 1–10, 2001.Google Scholar - [24]C. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions.
*Proceedings of the Symposium on Theory of Computing (STOL)*, 1987.Google Scholar - [25]R. O. Duda and P. E. Hart.
*Pattern Classification and Scene Analysis*. Wiley, New York, 1973.zbMATHGoogle Scholar - [26]P. Drineas, R. Kannan, A. Frieze, and V. Vinay. Clustering in large graphs and matrices.
*Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA)*, 1999.Google Scholar - [27]M. Ester, H. Kriegel, J. Snader, and X. Xu. A density-based algorithm for discovering clusters in large spatial database with noise.
*International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96)*, 1996.Google Scholar - [28]M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases.
*International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95)*, 1995.Google Scholar - [29]J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time.
*ACM Transactions on Mathematical Software*, 3: 209–226, 1977.zbMATHCrossRefGoogle Scholar - Toms Feder and Daniel H. Greene. Optimal algorithms for appropriate clustering.
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 434–444, 1988.Google Scholar - [31]V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS—Clustering categorical data using summaries.
*International Conference on Knowledge Discovery in Databases and Data Mining (KDD-99)*, 1999.Google Scholar - [32]D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems.
*Proceedings of the 24’th International Conference on Very Large Data Bases*, 1998.Google Scholar - S. Guha, H. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate xml joins.
*Proceedings of the Symposium on Management of Data (SIGMOD)*, pages 287–298, 2002.Google Scholar - S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms.
*Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms*, pages 649–657, 1998.Google Scholar - [35]S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams.
*Proceedings of the Symposium on Foundations of Computer Science (FOCS)*, 2000.Google Scholar - [36]S. Guha, R. Rastogi, and K. Shim. CURE: An efficient algorithm for clustering large databases.
*Proceedings of the Symposium on Management of Data (SIGMOD)*, 1998.Google Scholar - [37]S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes.
*Proceedings of ICDE*, 1999.Google Scholar - T. F. Gonzalez. Clustering to minimize the maximum intercluster distance.
*Theoretical Computer Science*, pages 293–306, 1985.Google Scholar - [39]Sudipto Guha. Approximation algorithms for facility location problems.
*Ph.D. Thesis, Stanford University*, 2000.Google Scholar - [40]E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. Technical report,
*1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery*, 1997.Google Scholar - [41]D. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem.
*Math of Operations Research*, 10 (2): 180–184, 1985.MathSciNetzbMATHCrossRefGoogle Scholar - [42]P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.
*Proceedings of the Symposium on Theory of Computing (STOC)*, 1998.Google Scholar - Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vem-pala. Locality-preserving hashing in multidimensional spaces.
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 618–625, 1997.Google Scholar - [44]P. Indyk. Sublinear time algorithms for metric space problems.
*Proceedings of the Symposium on Theory of Computing*, 1999.Google Scholar - Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces.
*Proceedings of the Symposium on Foundations of Computer Science (FOCS)*, pages 154–159, 1999.Google Scholar - [46]A. K. Jain and R. C. Dubes.
*Algorithms for Clustering Data*. Prentice Hall, 1988.Google Scholar - [47]K. Jain, M. Mandian, and A. Saberi. A new greedy approach for facility location problem.
*Proceedings of the Symposium on Theory of Computing (STOC)*, 2002.Google Scholar - [48]K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems.
*Proceedings of the Twenty-Ninth Annual IEEE Symposium on Foundations of Computer Science*, 1999.Google Scholar - [49]George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain.
*Proceedings of the ACM/IEEE Design Automation Conference*, 1997.Google Scholar - O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-media ns.
*SIAM Journal on Applied Mathematics*, pages 539–560, 1979.Google Scholar - Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces.
*Proceedings of the Symposium on Theory of Computing (STOC)*, pages 614–623, 1998.Google Scholar - S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem.
*Proc. 7th European Symposium on Algorithms*, pages 378–389, 1999.Google Scholar - Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral.
*Proceedings of the Symposium on Foundations of Computer Science (FOCS)*, pages 367–377, 2000.Google Scholar - [54]J. H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems.
*Information Processing Letters*, 44: 245–249, 1992.MathSciNetzbMATHCrossRefGoogle Scholar - [55]J. H. Lin and J. S. Vitter. c-approximations with minimum packing constraint violations.
*Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing*, 1992.Google Scholar - [56]O. L. Managasarian. Mathematical programming in data mining.
*Data Mining and Knowledge Discovery*, 1997.Google Scholar - [57]P. Mirchandani and R. Francis, editors. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990.zbMATHGoogle Scholar
- [58]Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering.
*Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA)*, 2001.Google Scholar - [59]R. Mettu and C. G. Plaxton. The onlike median problem.
*Proceedings of the 41st IEEE Foundations of Computer Science*, 2000.Google Scholar - [60]Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering.
*Manuscript*, 2002.Google Scholar - [61]R. Motwani and P. Raghavan.
*Randomized Algorithms*. Cambridge University Press, 1995.Google Scholar - [62]S. Muthukrishnan. Efficient algorithms for document retrieval problems.
*Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA)*, 2002.Google Scholar - [63]Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining.
*Proceedings of the 20’th International Conference on Very Large Data Bases*, 1994.Google Scholar - [64]C. F. Olson. Parallel algorithms for hierarchical clustering.
*Technical report, University of California at Berkeley*, 1993.Google Scholar - [65]Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. Streaming-data algorithms for high-quality clustering.
*Proceedings of ICDE*, 2002.Google Scholar - [66]Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering.
*Proceedings of the Symposium on Foundations of Computer Science (FOCS)*, 2000.Google Scholar - [67]Cecilia Procopiuc, Michael Jones, Pankaj K. Agarwal, and T. M. Murali. A monte cario algorithm for fast projective clustering.
*Proceedings of the Symposium on Management of Data (SIGMOD)*, 2002.Google Scholar - Hanan Samet.
*The Design and Analysis of Spatial Data Structures.*Addison Wesley, 1990.Google Scholar - T. Sellis, N. Roussopoulos, and C. Faloutsos. The
*R+*tree: a dynamic index for multi-dimensional objects.*Proceedings of the 13th International Conference on Very Large Data Bases*, pages 507–518, 1987.Google Scholar - Kyuseok Shim, Ramakrishnan Srikant, and Rakesh Agrawal. High-dimensional similarity joins. pages 301–311, 1997.Google Scholar
- D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems.
*Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing*, pages 265–274, 1997.Google Scholar - Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs.
*ICALP*, pages 249–260, 2001.Google Scholar - [73]H. Toivonen. Samping large databases for association rules.
*Proceedings of the International Conference on Very Large Databases (VLDB)*, 1996.Google Scholar - Vijay Vazirani.
*Approximation Algorithms*. Springer Verlag, 2001.Google Scholar - [75]J. S. Vitter. Random sampling with a reservoir.
*ACM Transactions on Mathematical Software*, 11 (1): 37–57, 1985.MathSciNetzbMATHCrossRefGoogle Scholar - Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases.
*Proceedings of the ACM SIGMOD Conference on Management of Data*, pages 103–114, 1996.Google Scholar - [77]K. Zhang and D. Sasha. Tree pattern matching. In Apocolisto and Galil, editors,
*Pattern Matching Algorithms*. Oxford University Press, 1997.Google Scholar

## Copyright information

© Kluwer Academic Publishers 2004