Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Nagesh, Harsha; Goil, Sanjay; Choudhary, Alok

doi:10.1007/978-1-4615-1733-7_19

Harsha Nagesh,
Sanjay Goil &
Alok Choudhary

Part of the book series: Massive Computing ((MACO,volume 2))

7 Citations

Abstract

Clustering techniques for large scale and high dimensional data sets have found great interest in recent literature. Such data sets are found both in scientific and commercial applications. Clustering is the process of identifying dense regions in a sparse multi-dimensional data set. Several clustering techniques proposed earlier either lack in scalability to a very large set of dimensions or to a large data set. Many of them require key user inputs making it hard to be useful for real world data sets or fail to represent the generated clusters in a intuitive way. We have designed and implemented, pMAFIA, a density and grid based clustering algorithm wherein a multi-dimensional space is divided into finer grids and the dense regions found are merged together to identify the clusters. For large data sets with a large number of dimensions fine division of the multi-dimensional space leads to an enormous amount of computation. We have introduced an adaptive grid framework which not only reduces the computation vastly by forming grids based on the data distribution, but also improves the quality of clustering. Clustering algorithms also need to explore clusters in a subspace of the total data space. We have implemented a new bottom up algorithm which explores all possible subspaces to identify the embedded clusters. Further our framework requires no user input, making pMAFIA a completely unsupervised data mining algorithm. Finally, we have also introduced parallelism in the clustering process, which enables our data mining tool to scale up to massive data sets and large set of dimensions. Data parallelism coupled with task parallelism have shown to yield the best parallelization results on a diverse set of synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A survey on parallel clustering algorithms for Big Data

Article 06 October 2020

A Domain Specific Language for Clustering

Optimizing Parallelization Strategies for the Big-Means Clustering Algorithm

References

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park. A Framework for Finding Projected Clusters in High Dimensional Spaces. In Proc. ACM SIGMOD International Conference on Management of Data, 1999.
Google Scholar
C.L. Blake and C.J. Merz. UCI repository of machine learning databases. 1998.
Google Scholar
L.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel KDD Systems, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Google Scholar
Jürgen Eichenauer-Herrmann and Holger Grothe. A new inversive congruential pseudorandom number generator with power of two modulus. ACM Transactions on Modeling and Computer Simulation, 2(1):1–11, January 1992.
Article MathSciNet MATH Google Scholar
M. Ester, H-P. Kriegel, J. Sander, and X. Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, 1996.
Google Scholar
K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.
MATH Google Scholar
S. Guha, R. Rastogi, and K Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In Proc. ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
A.K. Jain and R.C Dubes. Algorithms for Clustering Data. Prentice-Hall Inc., 1988.
MATH Google Scholar
R.S Michalski and R.E. Stepp. Learning from Observation: Conceptual Clustering. Machine Learning: An Artificial Intelligence Approach, I:331–363, 1983.
Article Google Scholar
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. Proc. 20th Int. Conf. on Very large Data Bases, Santiago, Chile, pages 144–155, 1994.
Google Scholar
R.T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proc. 20th International Conference on Very Large Databases, 1994.
Google Scholar
C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21, 1995.
MATH Google Scholar
L.H. Ungar and D.P. Foster. Clustering methods for collaborative filtering. AAAI Workshop on Recommendation Systems, 1998.
Google Scholar
A. S. Weigend and H. G. Zimmermann. Exploiting local relations as soft constraints to improve forecasting. Journal of Computational Intelligence in Finance, 6:14–23, 1998.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. A CM SIGMOD International Conference on Management of Data, 1996.
Google Scholar

Download references

Authors

Harsha Nagesh
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Goil
View author publications
You can also search for this author in PubMed Google Scholar
Alok Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Illinois, Chicago, USA
Robert L. Grossman
Lawrence Livermore National Laboratory, Livermore, USA
Chandrika Kamath
Sandia National Laboratories, Livermore, USA
Philip Kegelmeyer
Army High Performance Computing Research Center (AHPCRC), Minneapolis, USA
Vipin Kumar
Army Research Laboratory, Aberdeen Proving Ground, USA
Raju R. Namburu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nagesh, H., Goil, S., Choudhary, A. (2001). Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_19

Download citation

DOI: https://doi.org/10.1007/978-1-4615-1733-7_19
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-0114-7
Online ISBN: 978-1-4615-1733-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Abstract

Access this chapter

Preview

Similar content being viewed by others

A survey on parallel clustering algorithms for Big Data

A Domain Specific Language for Clustering

Optimizing Parallelization Strategies for the Big-Means Clustering Algorithm

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Parallel Algorithms for Clustering High-Dimensional Large-Scale Datasets

Abstract

Access this chapter

Preview

Similar content being viewed by others

A survey on parallel clustering algorithms for Big Data

A Domain Specific Language for Clustering

Optimizing Parallelization Strategies for the Big-Means Clustering Algorithm

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation