Advertisement

Data Mining and Knowledge Discovery

, Volume 31, Issue 4, pp 972–1005 | Cite as

Scalable density-based clustering with quality guarantees using random projections

  • Johannes Schneider
  • Michail Vlachos
Article

Abstract

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

Keywords

Density-based clustering Random projections Nearest neighbors 

Notes

Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant Agreement No. 259569.

References

  1. Achtert E, Böhm C, Kröger P (2006) DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Proceedings of the Pacific-Asia conference knowledge discovery and data mining (PAKDD), pp 119–128CrossRefGoogle Scholar
  2. Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18:369–378CrossRefGoogle Scholar
  3. Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 49–60Google Scholar
  4. Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html
  5. Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 661–670Google Scholar
  6. Chitta R, Murty MN (2010) Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification. Pattern Recognit 43(3):796–804CrossRefGoogle Scholar
  7. Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. In: Proceedings of the symposium on theory of computing (STOC), pp 537–546Google Scholar
  8. Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the annual symposium on computational geometry, pp 253–262Google Scholar
  9. de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52CrossRefGoogle Scholar
  10. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 226–231Google Scholar
  11. Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):341–352CrossRefGoogle Scholar
  12. Hinneburg A, Gabriel H-H (2007) Denclue 2.0: fast clustering based on kernel density estimation. In: Advances in intelligent data analysis (IDA), pp 70–80Google Scholar
  13. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 58–65Google Scholar
  14. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
  15. Jain AK, Law MHC (2005) Data clustering: a user’s dilemma. In: Proceedings of the pattern recognition and machine intelligence, pp 1–10Google Scholar
  16. Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206CrossRefGoogle Scholar
  17. Koyutürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4):447–461CrossRefGoogle Scholar
  18. Loh W-K, Yu H (2015) Fast density-based clustering through dataset partition using graphics processing units. Inf Sci 308:94–112CrossRefGoogle Scholar
  19. Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 861–866Google Scholar
  20. Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the SIAM international conference on data mining (SDM), pp 407–415CrossRefGoogle Scholar
  21. Schneider J, Wattenhofer R (2011) Distributed coloring depending on the chromatic number or the neighborhood growth. In: International colloquium structural information and communication complexity (SIROCCO), pp 246–257CrossRefGoogle Scholar
  22. Schneider J, Bogojeska J, Vlachos M (2014) Solving Linear SVMs with multiple 1D projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 221–230Google Scholar
  23. Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. PVLDB 8(12):1976–1987Google Scholar
  24. Urruty T, Djeraba C, Simovici DA (2007) Clustering by random projections. In: Industrial conference on data mining, pp 107–119Google Scholar
  25. Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280CrossRefGoogle Scholar
  26. Whang JJ, Sui X, Dhillon IS (2012) Scalable and memory-efficient clustering of large-scale social networks. In: Proceedings of the IEEE conference on data mining (ICDM), pp 705–714Google Scholar
  27. Yu Y, Zhao J, Wang X, Wang Q, Zhang Y (2015) Cludoop: an efficient distributed density-based clustering for big data using hadoop. Int J Distrib Sens Netw 2015:579391. doi: 10.1155/2015/579391 CrossRefGoogle Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.University of LiechtensteinVaduzLiechtenstein
  2. 2.IBM Research - ZürichRüschlikonSwitzerland

Personalised recommendations