Advertisement

Contraction Clustering (Raster)

A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10710)

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. Open image in new window (Raster) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, Raster is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 s—when using a slow interpreted programming language like Python. Furthermore, Raster is easily parallelizable.

Keywords

Algorithms Big data Machine learning Unsupervised learning Clustering 

Notes

Acknowledgments

This research was supported by the project Fleet telematics big data analytics for vehicle usage modeling and analysis (FUMA) in the funding program FFI: Strategic Vehicle Research and Innovation (DNR 2016-02207), which is administered by VINNOVA, the Swedish Government Agency for Innovation Systems.

References

  1. 1.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: International Conference on Management of Data, vol. 27, pp. 94–105. ACM (1998)Google Scholar
  2. 2.
    Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and provably good seedings for k-means. In: Advances in Neural Information Processing Systems, pp. 55–63 (2016)Google Scholar
  3. 3.
    Baker, D.M., Valleron, A.J.: An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE). Int. J. Health Geogr. 13(1), 46 (2014)CrossRefGoogle Scholar
  4. 4.
    Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the k-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)CrossRefGoogle Scholar
  5. 5.
    Danker, A.J., Rosenfeld, A.: Blob detection by relaxation. IEEE Trans. Pattern Anal. Mach. Intell. 1, 79–92 (1981)CrossRefGoogle Scholar
  6. 6.
    Darong, H., Peng, W.: Grid-based DBSCAN algorithm with referential parameters. Phys. Procedia 24, 1166–1170 (2012)CrossRefGoogle Scholar
  7. 7.
    van Diggelen, F., Enge, P.: The world’s first GPS MOOC and worldwide laboratory using smartphones. In: Proceedings of the 28th International Technical Meeting of The Satellite Division of the Institute of Navigation (ION GNSS+2015), pp. 361–369. ION (2015)Google Scholar
  8. 8.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)Google Scholar
  9. 9.
    Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S., Bouras, A.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)CrossRefGoogle Scholar
  10. 10.
    Hathaway, R.J., Bezdek, J.C.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG 1994, pp. 332–339. ACM, New York (1994)Google Scholar
  12. 12.
    Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1374–1385. Springer, Heidelberg (2005).  https://doi.org/10.1007/11523468_111 CrossRefGoogle Scholar
  13. 13.
    Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM (JACM) 57(2), 5 (2010)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Liao, W.k., Liu, Y., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: 7th Workshop on Mining Scientific and Engineering Datasets of SIAM International Conference on Data Mining, pp. 61–69 (2004)Google Scholar
  15. 15.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)Google Scholar
  17. 17.
    Pham, D.L.: Spatial models for fuzzy clustering. Comput. Vis. Image Underst. 84(2), 285–297 (2001)CrossRefMATHGoogle Scholar
  18. 18.
    Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol. 98, pp. 428–439 (1998)Google Scholar
  19. 19.
    Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clustering: a review. In: Murgante, B., et al. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 707–720. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-09156-3_49 Google Scholar
  20. 20.
    Wang, W., Yang, J., Muntz, R., et al.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, vol. 97, pp. 186–195 (1997)Google Scholar
  21. 21.
    Xiaoyun, C., Yufang, M., Yan, Z., Ping, W.: GMDBSCAN: multi-density DBSCAN cluster based on grid. In: IEEE International Conference on e-Business Engineering, ICEBE 2008, pp. 780–783. IEEE (2008)Google Scholar
  22. 22.
    Yang, M.S.: A survey of fuzzy clustering. Math. Comput. Modell. 18(11), 1–16 (1993)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Fraunhofer-Chalmers Research Centre for Industrial MathematicsGothenburgSweden

Personalised recommendations