Advertisement

Data Mining and Knowledge Discovery

, Volume 3, Issue 3, pp 263–290 | Cite as

A Fast Parallel Clustering Algorithm for Large Spatial Databases

  • Xiaowei Xu
  • Jochen Jäger
  • Hans-Peter Kriegel
Article

Abstract

The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with multiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure. We introduce the dR*-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.

clustering algorithms parallel algorithms distributed algorithms scalable data mining distributed index structures spatial databases 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R. and Shafer, J.C. 1996. Parallel mining of association rules: design, implementation, and experience. IBM Research Report.Google Scholar
  2. Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD Int. Conf. on Management of Data. Atlantic City, NJ, pp. 322–331.Google Scholar
  3. Berchtold S., Keim D.A., and Kriegel, H.-P. 1996. The X-tree: an index structure for high-dimensional data. Proc. 22nd Int. Conf. on Very Large Data Bases, Bombay, India, Morgan Kaufmann, pp. 28–39.Google Scholar
  4. Bially, T. 1969. Space-filling curves: their generation and their application to bandwidth reduction. IEEE Trans. on Information Theory, IT-15(6):658–664.CrossRefGoogle Scholar
  5. Cheung, D.W., Han, J., Ng, V.T., Fu, A.W., and Fu, Y. 1996. A fast distributed algorithm for mining association rules. Proc. Int. Conf. on Parallel and Distributed Information System (PDIS'96). Miami Beach, FL, USA.Google Scholar
  6. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 226–231.Google Scholar
  7. Ester, M., Kriegel, H.-P., and Xu, X. 1995. A database interface for clustering in large spatial databases. Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining. Montreal, Canada, 1995, pp. 94–99.Google Scholar
  8. Faloutsos, C. and Roseman, S. 1989. Fractals for secondary key retrieval. Proc. 8th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 247–252.Google Scholar
  9. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. Knowledge discovery and data mining: towards a unifying framework. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 82–88.Google Scholar
  10. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V. 1996. PVM: Parallel Virtual Machine—A User's Guide and Tutorial for Networked Parallel Computing. Cambridge, MA, London, England: The MIT Press, 3rd printing.Google Scholar
  11. Gueting, R.H. 1994. An introduction to spatial database systems. The VLDB Journal, 3(4):357–399.CrossRefGoogle Scholar
  12. Jaja, J. 1992. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, pp. 61–65.Google Scholar
  13. Kamel, I. and Faloutsos, C. 1993. On packing R-trees. Proc. 2nd Int. Conf. on Information and Knowledge Management (CIKM).Google Scholar
  14. Li, X. and Fang, Z. 1989. Parallel clustering algorithms. Parallel Computing, 11:275–290.CrossRefMathSciNetzbMATHGoogle Scholar
  15. Matheus, C.J., Chan, P.K., and Piatetsky-Shapiro, G. 1993. Systems for knowledge discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6):903–913.CrossRefGoogle Scholar
  16. Mehta, M. and DeWitt, D.J. 1997. Data placement in shared-nothing parallel database systems. VLDB Journal, 6:53–72.CrossRefGoogle Scholar
  17. Olson, C.F. 1995. Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313–1325.CrossRefzbMATHMathSciNetGoogle Scholar
  18. Park, J.-S., Chen, M.-S., and Yu, P.S. 1995. An effective hash based algorithm for mining association rules. Proc. ACM SIGMOD Int. Conf. on Management of Data. San Jose, CA, pp.175-186.Google Scholar
  19. Pfitzner, D.W., Salmon, J.K., and Sterling, T. 1998. Halo World: Tools for Parallel Cluster Finding in Astrophysical N-body Simulations. Data Mining and Knowledge Discovery. Kluwer Academic Publishers, Vol. 2,No. 2, pp. 419–438.Google Scholar
  20. Rasmussen, E.M. and Willett, P. 1989. Efficiency of hierarchical agglomerative clustering using the ICL distributed array processor. Journal of Documentation, 45(1):1–24.CrossRefGoogle Scholar
  21. Richards, A.J. 1983. Remote Sensing Digital Image Analysis. An Introduction, Berlin: Springer.Google Scholar
  22. Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. 1998. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, vol. 2, pp. 1–27.Google Scholar
  23. Stonebraker, M. 1986. The case for shared nothing. Database Eng., 9(1).Google Scholar
  24. Stonebraker, M., Frew, J., Gardels, K., and Meredith, J. 1993. The SEQUOIA 2000 storage benchmark. Proc. ACM SIGMOD Int. Conf. on Management of Data. Washington, DC, pp. 2–11.Google Scholar
  25. Xu, X. 1999. Efficient Clustering for Knowledge Discovery in Spatial Databases. Shaker, Germany: Aachen.Google Scholar
  26. Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. 1998. A distribution-based clustering algorithm for mining in large spatial databases. 14th Int. Conf. on Data Engineering (ICDE'98). Orlando, FL.Google Scholar
  27. Zhang, T., Ramakrishnan, R., and Livny, M. 1998. BIRCH: A New Data Clustering Algorithm and Its Applications, Kluwer Academic Publishers, pp. 1–40.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Xiaowei Xu
    • 1
  • Jochen Jäger
    • 2
  • Hans-Peter Kriegel
    • 2
  1. 1.Corporate TechnologyMünchenGermany
  2. 2.Institute for Computer ScienceUniversity of MunichMünchenGermany

Personalised recommendations