Skip to main content

Contraction Clustering (RASTER)

A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

Part of the Lecture Notes in Computer Science book series (LNISA,volume 10710)

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. (RASTER) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, RASTER is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 s—when using a slow interpreted programming language like Python. Furthermore, RASTER is easily parallelizable.

Keywords

  • Algorithms
  • Big data
  • Machine learning
  • Unsupervised learning
  • Clustering

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-72926-8_6
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-72926-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.00
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.

Notes

  1. 1.

    The chosen shorthand may not be immediately obvious: RASTER operates on an implied grid. Resulting clusters can be made to look similar to the dot matrix structure of a raster graphics image. Furthermore, the name RASTER is an agglomerated contraction of the words and

  2. 2.

    On a contemporary workstation with 16 GB RAM, the scikit-learn implementation of DBSCAN cannot even handle one million data points.

  3. 3.

    We have identified tens of thousands of clusters with RASTER in a huge real-world data set, which shows that k-means clustering would have been highly unsuitable.

  4. 4.

    Reference implementations in several programming languages are available at https://gitlab.com/fraunhofer_chalmers_centre/contraction_clustering_raster.

  5. 5.

    In case it is not obvious why this is in linear time: For each row in a grid, take the current and next row into account. Start, for instance, with the tile in the top left corner and take its right neighbor as well as the two tiles adjacent in the next row into account.

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: International Conference on Management of Data, vol. 27, pp. 94–105. ACM (1998)

    Google Scholar 

  2. Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and provably good seedings for k-means. In: Advances in Neural Information Processing Systems, pp. 55–63 (2016)

    Google Scholar 

  3. Baker, D.M., Valleron, A.J.: An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE). Int. J. Health Geogr. 13(1), 46 (2014)

    CrossRef  Google Scholar 

  4. Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the k-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)

    CrossRef  Google Scholar 

  5. Danker, A.J., Rosenfeld, A.: Blob detection by relaxation. IEEE Trans. Pattern Anal. Mach. Intell. 1, 79–92 (1981)

    CrossRef  Google Scholar 

  6. Darong, H., Peng, W.: Grid-based DBSCAN algorithm with referential parameters. Phys. Procedia 24, 1166–1170 (2012)

    CrossRef  Google Scholar 

  7. van Diggelen, F., Enge, P.: The world’s first GPS MOOC and worldwide laboratory using smartphones. In: Proceedings of the 28th International Technical Meeting of The Satellite Division of the Institute of Navigation (ION GNSS+2015), pp. 361–369. ION (2015)

    Google Scholar 

  8. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  9. Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S., Bouras, A.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)

    CrossRef  Google Scholar 

  10. Hathaway, R.J., Bezdek, J.C.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)

    MathSciNet  CrossRef  Google Scholar 

  11. Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG 1994, pp. 332–339. ACM, New York (1994)

    Google Scholar 

  12. Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1374–1385. Springer, Heidelberg (2005). https://doi.org/10.1007/11523468_111

    CrossRef  Google Scholar 

  13. Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM (JACM) 57(2), 5 (2010)

    MathSciNet  CrossRef  Google Scholar 

  14. Liao, W.k., Liu, Y., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: 7th Workshop on Mining Scientific and Engineering Datasets of SIAM International Conference on Data Mining, pp. 61–69 (2004)

    Google Scholar 

  15. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    MathSciNet  CrossRef  Google Scholar 

  16. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  17. Pham, D.L.: Spatial models for fuzzy clustering. Comput. Vis. Image Underst. 84(2), 285–297 (2001)

    MathSciNet  CrossRef  Google Scholar 

  18. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: VLDB, vol. 98, pp. 428–439 (1998)

    Google Scholar 

  19. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data clustering: a review. In: Murgante, B., et al. (eds.) ICCSA 2014. LNCS, vol. 8583, pp. 707–720. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09156-3_49

    CrossRef  Google Scholar 

  20. Wang, W., Yang, J., Muntz, R., et al.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, vol. 97, pp. 186–195 (1997)

    Google Scholar 

  21. Xiaoyun, C., Yufang, M., Yan, Z., Ping, W.: GMDBSCAN: multi-density DBSCAN cluster based on grid. In: IEEE International Conference on e-Business Engineering, ICEBE 2008, pp. 780–783. IEEE (2008)

    Google Scholar 

  22. Yang, M.S.: A survey of fuzzy clustering. Math. Comput. Modell. 18(11), 1–16 (1993)

    MathSciNet  CrossRef  Google Scholar 

Download references

Acknowledgments

This research was supported by the project Fleet telematics big data analytics for vehicle usage modeling and analysis (FUMA) in the funding program FFI: Strategic Vehicle Research and Innovation (DNR 2016-02207), which is administered by VINNOVA, the Swedish Government Agency for Innovation Systems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregor Ulm .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Ulm, G., Gustavsson, E., Jirstrand, M. (2018). Contraction Clustering (RASTER). In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72926-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72925-1

  • Online ISBN: 978-3-319-72926-8

  • eBook Packages: Computer ScienceComputer Science (R0)