Advertisement

East European Conference on Advances in Databases and Information Systems

ADBIS 2015: New Trends in Databases and Information Systems pp 175-185 | Cite as

CLUS: Parallel Subspace Clustering Algorithm on Spark

  • Bo Zhu
  • Alexandru Mara
  • Alberto Mozo
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 539)

Abstract

Subspace clustering techniques were proposed to discover hidden clusters that only exist in certain subsets of the full feature spaces. However, the time complexity of such algorithms is at most exponential with respect to the dimensionality of the dataset. In addition, datasets are generally too large to fit in a single machine under the current big data scenarios. The extremely high computational complexity, which results in poor scalability with respect to both size and dimensionality of these datasets, give us strong motivations to propose a parallelized subspace clustering algorithm able to handle large high dimensional data. To the best of our knowledge, there are no other parallel subspace clustering algorithms that run on top of new generation big data distributed platforms such as MapReduce and Spark. In this paper we introduce CLUS: a novel parallel solution of subspace clustering based on SUBCLU algorithm. CLUS uses a new dynamic data partitioning method specifically designed to continuously optimize the varying size and content of required data for each iteration in order to fully take advantage of Spark’s in-memory primitives. This method minimizes communication cost between nodes, maximizes their CPU usage, and balances the load among them. Consequently the execution time is significantly reduced. Finally, we conduct several experiments with a series of real and synthetic datasets to demonstrate the scalability, accuracy and the nearly linear speedup with respect to number of nodes of the implementation.

Keywords

Subspace Parallel Clustering Spark Big data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769 (1965)Google Scholar
  2. 2.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996)Google Scholar
  3. 3.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. John Wiley & Sons (1990)Google Scholar
  4. 4.
    Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: Proc. VLDB, vol. 2(1) (2009)Google Scholar
  5. 5.
    Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572 (1901)CrossRefGoogle Scholar
  6. 6.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence 27(8), 1226–1238 (2005)CrossRefGoogle Scholar
  7. 7.
    Zimek, A., Assent, I., Vreeken, J.: Frequent pattern mining algorithms for data clustering. In: Frequent Pattering Mining, chapter 16, pp. 403–423. Springer International Publishing (2014)Google Scholar
  8. 8.
    Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. SIAM, pp. 246–257 (2004)Google Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data In Proc. on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  10. 10.
    Shvachko, K., et al.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE (2010)Google Scholar
  11. 11.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. USENIX (2012)Google Scholar
  12. 12.
    Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)CrossRefGoogle Scholar
  13. 13.
    Sim, K., Gopalkrishnan, V., Zimek, A., Cong, G.: A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery 26(2) (2013)Google Scholar
  14. 14.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD, pp. 94–105 (1998)Google Scholar
  15. 15.
    Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proc. SIGKDD, pp. 84–93 (1999)Google Scholar
  16. 16.
    Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: Proc. ACM SIGMOD, pp. 61–72 (1999)Google Scholar
  17. 17.
    Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proc. ACM SIGMOD, pp. 70–81 (2000)Google Scholar
  18. 18.
    Sequeira, K., Zaki, M.: SCHISM: a new approach for interesting subspace mining. In: Proc. ICDM, pp. 186–193 (2004)Google Scholar
  19. 19.
    Liu, G., Sim, K., Li, J., Wong, L.: Efficient mining of distance-based subspace clusters. Statistical Analysis and Data Mining 2(5–6), 427–444 (2010)MathSciNetGoogle Scholar
  20. 20.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proc. ICDM, pp. 719–724 (2008)Google Scholar
  21. 21.
    Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proc. SIGKDD, pp. 533–541 (2008)Google Scholar
  22. 22.
    Gunnemann, S., Farber, I., Boden, B., Seidl, T.: Subspace clustering meets dense subgraph mining: a synthesis of two paradigms. In: Proc. ICDM (2010)Google Scholar
  23. 23.
    Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proc. SIGKDD (1999)Google Scholar
  24. 24.
  25. 25.
    Domenoconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Proc. SIAM (2004)Google Scholar
  26. 26.
    Nazerzadeh, H., Ghodsi, M., Sadjadian, S.: Parallel subspace clustering. In: Proc. the 10th Annual Conference of Computer Society of Iran (2005)Google Scholar
  27. 27.
    Achtert, E., Kriegel, H.-P., Zimek, A.: ELKI: a software system for evaluation of subspace clustering algorithms. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 580–585. Springer, Heidelberg (2008) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Universidad Politécnica de MadridMadridSpain

Personalised recommendations