Advertisement

Knowledge and Information Systems

, Volume 52, Issue 1, pp 83–111 | Cite as

Synchronization-based scalable subspace clustering of high-dimensional data

  • Junming Shao
  • Xinzuo Wang
  • Qinli Yang
  • Claudia Plant
  • Christian Böhm
Regular Paper

Abstract

How to address the challenges of the “curse of dimensionality” and “scalability” in clustering simultaneously? In this paper, we propose arbitrarily oriented synchronized clusters (ORSC), a novel effective and efficient method for subspace clustering inspired by synchronization. Synchronization is a basic phenomenon prevalent in nature, capable of controlling even highly complex processes such as opinion formation in a group. Control of complex processes is achieved by simple operations based on interactions between objects. Relying on the weighted interaction model and iterative dynamic clustering, our approach ORSC (a) naturally detects correlation clusters in arbitrarily oriented subspaces, including arbitrarily shaped nonlinear correlation clusters. Our approach is (b) robust against noise and outliers. In contrast to previous methods, ORSC is (c) easy to parameterize, since there is no need to specify the subspace dimensionality or other difficult parameters. Instead, all interesting subspaces are detected in a fully automatic way. Finally, (d) ORSC outperforms most comparison methods in terms of runtime efficiency and is highly scalable to large and high-dimensional data sets. Extensive experiments have demonstrated the effectiveness and efficiency of our approach.

Keywords

Subspace clustering Synchronization High-dimensional data Large data set 

Notes

Acknowledgements

The research was supported partially by the National Natural Science Foundation of China (Grant Nos. 61403062, 61433014, 41601025), China Postdoctoral Science Foundation (2014M552344, 2015M580786), Science-Technology Foundation for Young Scientist of SiChuan Province (2016JQ0007) and Fundamental Research Funds for the Central Universities (Grant Nos. ZYGX2014J053, ZYGX2014J091).

References

  1. 1.
    Aeyels D, De Smet F (2008) A mathematical model for the dynamics of clustering. Phys D Nonlinear Phenom 273(19):2517–2530MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Aggarwal CC, Wolf JL, Yu PS et al (1999) Fast algorithms for projected clustering. ACM SIGMOD international conference on management of data, pp 61–72Google Scholar
  3. 3.
    Aggarwal CC, Yu P S (2000) Finding generalized projected clusters in high dimensional spaces. ACM SIGMOD international conference on management of data, pp 70–81Google Scholar
  4. 4.
    Agrawal R, Gehrke JE, Gunopulos D et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD international conference on management of data, pp 94–105Google Scholar
  5. 5.
    Ankerst M, Breunig MM, Kriegel HP et al (1999) Optics: ordering points to identify the clustering structure. ACM SIGMOD international conference on management of data, pp 49–60Google Scholar
  6. 6.
    Arenas A, Diaz-Guilera A, Perez-Vicente CJ (2006) Synchronization reveals topological scales in complex networks. Phys Rev Lett 96(11):1–4CrossRefzbMATHGoogle Scholar
  7. 7.
    Arenas A, Diaz-Guilera A, Kurths J et al (2008) Synchronization in complex networks. Phys Rep 469:93–153MathSciNetCrossRefGoogle Scholar
  8. 8.
    Bahrololoum A, Nezamabadi-pour H, Saryazdi S (2015) A data clustering approach based on universal gravity rule. Eng Appl Artif Intell 45:415–428CrossRefGoogle Scholar
  9. 9.
    Böhm C, Kailing K, Kröger P et al (2004) Computing clusters of correlation connected objects. ACM SIGMOD international conference on management of data, pp 455–466Google Scholar
  10. 10.
    Böhm C, Plant C, Shao J et al (2010) Clustering by synchronization. ACM SIGKDD international conference on knowledge discovery and data mining, pp 583–592Google Scholar
  11. 11.
    Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. ACM SIGKDD international conference on knowledge discovery and data mining, pp 84–93Google Scholar
  12. 12.
    Elhamifar E, Vidal R (2013) Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell 35(11):2765–2781CrossRefGoogle Scholar
  13. 13.
    Frey B, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Givoni I, Chung C, Frey B (2011) Hierarchical affinity propagation. 27th conference on uncertainty in artificial intelligence, Barcelona, SpainGoogle Scholar
  15. 15.
    Goil S, Nagesh H, Choudhary A (1999) MAFIA: efficient and scalable subspace clustering for very large data sets. ACM SIGKDD international conference on knowledge discovery and data mining, pp 443–452Google Scholar
  16. 16.
    Günnemann S, Faloutsos C (2013) Mixed membership subspace clustering. IEEE international conference on data mining, pp 221–230Google Scholar
  17. 17.
    Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. International conference on very large data bases, pp 506–517Google Scholar
  18. 18.
    Huang J, Sun H, Kang J et al (2013) ESC: an efficient synchronization-based clustering algorithm. Knowl Based Syst 40:111–122CrossRefGoogle Scholar
  19. 19.
    Indulska M, Orlowska M (2002) Gravity based spatial clustering. ACM international symposium on advances in geographic information systems, pp 125–130Google Scholar
  20. 20.
    Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle RiverzbMATHGoogle Scholar
  21. 21.
    Kailing K, Kriegel HP, Kröger P (2004) Density-connected subspace clustering for high-dimensional data. SIAM international conference on data mining, p 4Google Scholar
  22. 22.
    Kim CS, Bae CS, Tcha HJ (2008) A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data. BMC Bioinform 9:1CrossRefGoogle Scholar
  23. 23.
    Kuramoto Y(1975) Self-entrainment of a population of coupled nonlinear oscillators. In: Araki H (ed) Proceedings of the international symposium on mathematical problems in theoretical physics. Lecture notes in physics. Springer, New York, pp 420–422Google Scholar
  24. 24.
    Kuramoto Y (1984) Chemical oscillations, waves, and turbulence. Springer, BerlinCrossRefzbMATHGoogle Scholar
  25. 25.
    Liu J, Wang W (2003) Op-cluster: clustering by tendency in high dimensional space. IEEE international conference on data mining, pp 187–194Google Scholar
  26. 26.
    Oyang Y, Chen C, Yang T (2001) A study on the hierarchical data clustering algorithm based on gravity theory. Principles of data mining and knowledge discovery, pp 350–361Google Scholar
  27. 27.
    Procopiuc CM, Jones M, Agarwal PK et al (2002) A Monte Carlo algorithm for fast projective clustering. ACM SIGMOD international conference on management of data, pp 418–427Google Scholar
  28. 28.
    Shao J (2012) Synchronization on data mining: a universal concept for knowledge discovery. LAP LAMBERT Academic Publishing, SaarbrückenGoogle Scholar
  29. 29.
    Shao J, He X, Böhm C et al (2013) Synchronization-inspired partitioning and hierarchical clustering. IEEE Trans Knowl Discov Data Eng 25(4):893–905CrossRefGoogle Scholar
  30. 30.
    Shao J, Yang Q, Dang H et al (2016) Scalable clustering by iterative partitioning and point attractor representation. ACM Trans Knowl Discov Data 11(1):5CrossRefGoogle Scholar
  31. 31.
    Shao J, Ahmadi Z, Kramer S (2014) Prototype-based Learning on concept-drifting data streams. ACM SIGKDD international conference on knowledge discovery and data mining, pp 512–521Google Scholar
  32. 32.
    Shao J, Böhm C, Yang Q et al (2010) Synchronization based outlier detection. ECML/PKDD 2010, pp 245–260Google Scholar
  33. 33.
    Shao J, He X, Yang Q et al (2013) Robust synchronization-based graph clustering. Pacific-Asia conference on knowledge discovery and data mining, pp 249–260Google Scholar
  34. 34.
    Tung AKH, Xu X, Ooi BC (2005) Curler: finding and visualizing nonlinear correlated clusters. ACM SIGMOD international conference on management of data, pp 467–478Google Scholar
  35. 35.
    Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: The 26th international conference on machine learning, pp 1073–1080Google Scholar
  36. 36.
    Wang H, Wang W, Yang J et al (2002) Clustering by pattern similarity in large data sets. ACM SIGMOD international conference on management of data, pp 394–405Google Scholar
  37. 37.
    Ying W, Chung F, Wang S (2014) Scaling up synchronization-inspired partitioning clustering. IEEE Trans Knowl Data Eng 26(8):2045–2057CrossRefGoogle Scholar
  38. 38.
    Zhang T, Ramakrishnan R, Livny M (1996) An efficient data clustering method for very large databases. ACM SIGMOD international conference on management of data, pp 103–114Google Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Junming Shao
    • 1
  • Xinzuo Wang
    • 1
  • Qinli Yang
    • 1
  • Claudia Plant
    • 2
  • Christian Böhm
    • 3
  1. 1.School of Computer Science and Engineering, Big Data Research CenterUniversity of Electronic Science and Technology of ChinaChengduChina
  2. 2.Institute for Computer ScienceUniversity of ViennaViennaAustria
  3. 3.Institute for Computer ScienceUniversity of MunichMunichGermany

Personalised recommendations