A Fast Parallel Clustering Algorithm for Large Spatial Databases
- 542 Downloads
- 89 Citations
Abstract
The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with multiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure. We introduce the dR*-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.
Preview
Unable to display preview. Download preview PDF.
References
- Agrawal, R. and Shafer, J.C. 1996. Parallel mining of association rules: design, implementation, and experience. IBM Research Report.Google Scholar
- Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. Proc. ACM SIGMOD Int. Conf. on Management of Data. Atlantic City, NJ, pp. 322–331.Google Scholar
- Berchtold S., Keim D.A., and Kriegel, H.-P. 1996. The X-tree: an index structure for high-dimensional data. Proc. 22nd Int. Conf. on Very Large Data Bases, Bombay, India, Morgan Kaufmann, pp. 28–39.Google Scholar
- Bially, T. 1969. Space-filling curves: their generation and their application to bandwidth reduction. IEEE Trans. on Information Theory, IT-15(6):658–664.CrossRefGoogle Scholar
- Cheung, D.W., Han, J., Ng, V.T., Fu, A.W., and Fu, Y. 1996. A fast distributed algorithm for mining association rules. Proc. Int. Conf. on Parallel and Distributed Information System (PDIS'96). Miami Beach, FL, USA.Google Scholar
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 226–231.Google Scholar
- Ester, M., Kriegel, H.-P., and Xu, X. 1995. A database interface for clustering in large spatial databases. Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining. Montreal, Canada, 1995, pp. 94–99.Google Scholar
- Faloutsos, C. and Roseman, S. 1989. Fractals for secondary key retrieval. Proc. 8th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 247–252.Google Scholar
- Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. Knowledge discovery and data mining: towards a unifying framework. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, pp. 82–88.Google Scholar
- Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V. 1996. PVM: Parallel Virtual Machine—A User's Guide and Tutorial for Networked Parallel Computing. Cambridge, MA, London, England: The MIT Press, 3rd printing.Google Scholar
- Gueting, R.H. 1994. An introduction to spatial database systems. The VLDB Journal, 3(4):357–399.CrossRefGoogle Scholar
- Jaja, J. 1992. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, pp. 61–65.Google Scholar
- Kamel, I. and Faloutsos, C. 1993. On packing R-trees. Proc. 2nd Int. Conf. on Information and Knowledge Management (CIKM).Google Scholar
- Li, X. and Fang, Z. 1989. Parallel clustering algorithms. Parallel Computing, 11:275–290.CrossRefMathSciNetzbMATHGoogle Scholar
- Matheus, C.J., Chan, P.K., and Piatetsky-Shapiro, G. 1993. Systems for knowledge discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6):903–913.CrossRefGoogle Scholar
- Mehta, M. and DeWitt, D.J. 1997. Data placement in shared-nothing parallel database systems. VLDB Journal, 6:53–72.CrossRefGoogle Scholar
- Olson, C.F. 1995. Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8):1313–1325.CrossRefzbMATHMathSciNetGoogle Scholar
- Park, J.-S., Chen, M.-S., and Yu, P.S. 1995. An effective hash based algorithm for mining association rules. Proc. ACM SIGMOD Int. Conf. on Management of Data. San Jose, CA, pp.175-186.Google Scholar
- Pfitzner, D.W., Salmon, J.K., and Sterling, T. 1998. Halo World: Tools for Parallel Cluster Finding in Astrophysical N-body Simulations. Data Mining and Knowledge Discovery. Kluwer Academic Publishers, Vol. 2,No. 2, pp. 419–438.Google Scholar
- Rasmussen, E.M. and Willett, P. 1989. Efficiency of hierarchical agglomerative clustering using the ICL distributed array processor. Journal of Documentation, 45(1):1–24.CrossRefGoogle Scholar
- Richards, A.J. 1983. Remote Sensing Digital Image Analysis. An Introduction, Berlin: Springer.Google Scholar
- Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. 1998. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, vol. 2, pp. 1–27.Google Scholar
- Stonebraker, M. 1986. The case for shared nothing. Database Eng., 9(1).Google Scholar
- Stonebraker, M., Frew, J., Gardels, K., and Meredith, J. 1993. The SEQUOIA 2000 storage benchmark. Proc. ACM SIGMOD Int. Conf. on Management of Data. Washington, DC, pp. 2–11.Google Scholar
- Xu, X. 1999. Efficient Clustering for Knowledge Discovery in Spatial Databases. Shaker, Germany: Aachen.Google Scholar
- Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. 1998. A distribution-based clustering algorithm for mining in large spatial databases. 14th Int. Conf. on Data Engineering (ICDE'98). Orlando, FL.Google Scholar
- Zhang, T., Ramakrishnan, R., and Livny, M. 1998. BIRCH: A New Data Clustering Algorithm and Its Applications, Kluwer Academic Publishers, pp. 1–40.Google Scholar