# Scalable density-based clustering with quality guarantees using random projections

- 490 Downloads
- 1 Citations

## Abstract

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

## Keywords

Density-based clustering Random projections Nearest neighbors## Notes

### Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant Agreement No. 259569.

## References

- Achtert E, Böhm C, Kröger P (2006) DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Proceedings of the Pacific-Asia conference knowledge discovery and data mining (PAKDD), pp 119–128CrossRefGoogle Scholar
- Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18:369–378CrossRefGoogle Scholar
- Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 49–60Google Scholar
- Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html
- Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 661–670Google Scholar
- Chitta R, Murty MN (2010) Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification. Pattern Recognit 43(3):796–804CrossRefGoogle Scholar
- Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. In: Proceedings of the symposium on theory of computing (STOC), pp 537–546Google Scholar
- Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the annual symposium on computational geometry, pp 253–262Google Scholar
- de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52CrossRefGoogle Scholar
- Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 226–231Google Scholar
- Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):341–352CrossRefGoogle Scholar
- Hinneburg A, Gabriel H-H (2007) Denclue 2.0: fast clustering based on kernel density estimation. In: Advances in intelligent data analysis (IDA), pp 70–80Google Scholar
- Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 58–65Google Scholar
- Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
- Jain AK, Law MHC (2005) Data clustering: a user’s dilemma. In: Proceedings of the pattern recognition and machine intelligence, pp 1–10Google Scholar
- Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206CrossRefGoogle Scholar
- Koyutürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4):447–461CrossRefGoogle Scholar
- Lin C-J (2011) LibSVM datasets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
- Loh W-K, Yu H (2015) Fast density-based clustering through dataset partition using graphics processing units. Inf Sci 308:94–112CrossRefGoogle Scholar
- Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 861–866Google Scholar
- Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the SIAM international conference on data mining (SDM), pp 407–415CrossRefGoogle Scholar
- Schneider J, Wattenhofer R (2011) Distributed coloring depending on the chromatic number or the neighborhood growth. In: International colloquium structural information and communication complexity (SIROCCO), pp 246–257CrossRefGoogle Scholar
- Schneider J, Bogojeska J, Vlachos M (2014) Solving Linear SVMs with multiple 1D projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 221–230Google Scholar
- Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. PVLDB 8(12):1976–1987Google Scholar
- Urruty T, Djeraba C, Simovici DA (2007) Clustering by random projections. In: Industrial conference on data mining, pp 107–119Google Scholar
- Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280CrossRefGoogle Scholar
- Whang JJ, Sui X, Dhillon IS (2012) Scalable and memory-efficient clustering of large-scale social networks. In: Proceedings of the IEEE conference on data mining (ICDM), pp 705–714Google Scholar
- Yu Y, Zhao J, Wang X, Wang Q, Zhang Y (2015) Cludoop: an efficient distributed density-based clustering for big data using hadoop. Int J Distrib Sens Netw 2015:579391. doi: 10.1155/2015/579391 CrossRefGoogle Scholar