Skip to main content

Contraction Clustering (RASTER)

A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

  • Conference paper
  • First Online:
Book cover Machine Learning, Optimization, and Big Data (MOD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. (RASTER) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, RASTER is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 s—when using a slow interpreted programming language like Python. Furthermore, RASTER is easily parallelizable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The chosen shorthand may not be immediately obvious: RASTER operates on an implied grid. Resulting clusters can be made to look similar to the dot matrix structure of a raster graphics image. Furthermore, the name RASTER is an agglomerated contraction of the words and

  2. 2.

    On a contemporary workstation with 16 GB RAM, the scikit-learn implementation of DBSCAN cannot even handle one million data points.

  3. 3.

    We have identified tens of thousands of clusters with RASTER in a huge real-world data set, which shows that k-means clustering would have been highly unsuitable.

  4. 4.

    Reference implementations in several programming languages are available at https://gitlab.com/fraunhofer_chalmers_centre/contraction_clustering_raster.

  5. 5.

    In case it is not obvious why this is in linear time: For each row in a grid, take the current and next row into account. Start, for instance, with the tile in the top left corner and take its right neighbor as well as the two tiles adjacent in the next row into account.

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: International Conference on Management of Data, vol. 27, pp. 94–105. ACM (1998)

    Google Scholar 

  2. Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and provably good seedings for k-means. In: Advances in Neural Information Processing Systems, pp. 55–63 (2016)

    Google Scholar 

  3. Baker, D.M., Valleron, A.J.: An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE). Int. J. Health Geogr. 13(1), 46 (2014)

    Article  Google Scholar 

  4. Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the k-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)

    Article  Google Scholar 

  5. Danker, A.J., Rosenfeld, A.: Blob detection by relaxation. IEEE Trans. Pattern Anal. Mach. Intell. 1, 79–92 (1981)

    Article  Google Scholar 

  6. Darong, H., Peng, W.: Grid-based DBSCAN algorithm with referential parameters. Phys. Procedia 24, 1166–1170 (2012)

    Article  Google Scholar 

  7. van Diggelen, F., Enge, P.: The world’s first GPS MOOC and worldwide laboratory using smartphones. In: Proceedings of the 28th International Technical Meeting of The Satellite Division of the Institute of Navigation (ION GNSS+2015), pp. 361–369. ION (2015)

    Google Scholar 

  8. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  9. Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S., Bouras, A.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)

    Article  Google Scholar 

  10. Hathaway, R.J., Bezdek, J.C.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)

    Article  MathSciNet  Google Scholar 

  11. Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG 1994, pp. 332–339. ACM, New York (1994)

    Google Scholar 

  12. Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1374–1385. Springer, Heidelberg (2005). https://doi.org/10.1007/11523468_111

    Chapter  Google Scholar 

  13. Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM (JACM) 57(2), 5 (2010)

    Article  MathSciNet  Google Scholar 

  14. Liao, W.k., Liu, Y., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: 7th Workshop on Mining Scientific and Engineering Datasets of SIAM International Conference on Data Mining, pp. 61–69 (2004)

    Google Scholar 

  15. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  16. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  17. Pham, D.L.: Spatial models for fuzzy clustering. Comput. Vis. Image Underst. 84(2), 285–297 (2001)

    Article  MathSciNet  Google Scholar 

  18. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol. 98, pp. 428–439 (1998)

    Google Scholar 

  19. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clustering: a review. In: Murgante, B., et al. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 707–720. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09156-3_49

    Chapter  Google Scholar 

  20. Wang, W., Yang, J., Muntz, R., et al.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, vol. 97, pp. 186–195 (1997)

    Google Scholar 

  21. Xiaoyun, C., Yufang, M., Yan, Z., Ping, W.: GMDBSCAN: multi-density DBSCAN cluster based on grid. In: IEEE International Conference on e-Business Engineering, ICEBE 2008, pp. 780–783. IEEE (2008)

    Google Scholar 

  22. Yang, M.S.: A survey of fuzzy clustering. Math. Comput. Modell. 18(11), 1–16 (1993)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This research was supported by the project Fleet telematics big data analytics for vehicle usage modeling and analysis (FUMA) in the funding program FFI: Strategic Vehicle Research and Innovation (DNR 2016-02207), which is administered by VINNOVA, the Swedish Government Agency for Innovation Systems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregor Ulm .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ulm, G., Gustavsson, E., Jirstrand, M. (2018). Contraction Clustering (RASTER). In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72926-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72925-1

  • Online ISBN: 978-3-319-72926-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics