Frequent Pattern Mining Algorithms for Data Clustering

Chapter

Abstract

Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field.

In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data. In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining.

Keywords

Subspace clustering Monotonicity Redundancy 

References

  1. 1.
    E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace cluster hierarchies. In 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pages 152–163, 2007.Google Scholar
  2. 2.
    E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Robust, complete, and efficient correlation clustering. In 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pages 413–418, 2007.Google Scholar
  3. 3.
    C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 61–72, 1999.Google Scholar
  4. 4.
    C. C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in high dimensional space. In 8th International Conference on Database Theory (ICDT), London, UK, pages 420–434, 2001.Google Scholar
  5. 5.
    C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. Zaki. Xproj: a framework for projected structural clustering of xml documents. In 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pages 46–55, 2007.Google Scholar
  6. 6.
    R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pages 487–499, 1994.Google Scholar
  7. 7.
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 94–105, 1998.Google Scholar
  8. 8.
    M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 49–60, 1999.Google Scholar
  9. 9.
    I. Assent. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):340–350, 2012.Google Scholar
  10. 10.
    I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: dimensionality unbiased subspace clustering. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 409–414, 2007.Google Scholar
  11. 11.
    I. Assent, R. Krieger, E. Müller, and T. Seidl. EDSC: efficient density-based subspace clustering. In 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pages 1093–1102, 2008.Google Scholar
  12. 12.
    I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: indexing subspace clusters with in-process-removal of redundancy. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 719–724, 2008.Google Scholar
  13. 13.
    I. Assent, E. Müller, S. Günnemann, R. Krieger, and T. Seidl. Less is more: Non-redundant subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.Google Scholar
  14. 14.
    E. Bae and J. Bailey. COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 53–62, 2006.Google Scholar
  15. 15.
    C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kröger, and C. Plant. Subspace selection for clustering high-dimensional data. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 11–18, 2004.Google Scholar
  16. 16.
    R. Bayardo. Efficiently mining long patterns from databases. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 85–93, 1998.Google Scholar
  17. 17.
    K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 233–243, 1999.Google Scholar
  18. 18.
    K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 217–235, 1999.Google Scholar
  19. 19.
    S. Bickel and T. Scheffer. Multi-view clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 19–26, 2004.Google Scholar
  20. 20.
    R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Gold Coast, Australia, pages 160–172, 2013.Google Scholar
  21. 21.
    C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 84–93, 1999.Google Scholar
  22. 22.
    Y. Cui, X. Z. Fern, and J. G. Dy. Non-redundant multi-view clustering via orthogonalization. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 133–142, 2007.Google Scholar
  23. 23.
    X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 118–129, 2010.Google Scholar
  24. 24.
    I. Davidson and Z. Qi. Finding alternative clusterings using constraints. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 773–778, 2008.Google Scholar
  25. 25.
    I. Davidson, S. S. Ravi, and L. Shamis. A SAT-based framework for efficient constrained clustering. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 94–105, 2010.Google Scholar
  26. 26.
    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1):1–31, 1977.MATHMathSciNetGoogle Scholar
  27. 27.
    R. J. Durrant and A. Kaban. When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity, 25(4):385–397, 2009.CrossRefMATHMathSciNetGoogle Scholar
  28. 28.
    M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pages 226–231, 1996.Google Scholar
  29. 29.
    I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.Google Scholar
  30. 30.
    D. François, V. Wertz, and M. Verleysen. The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19(7):873–886, 2007.CrossRefGoogle Scholar
  31. 31.
    G. Gan, C. Ma, and J. Wu. Data Clustering. Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics (SIAM), 2007.Google Scholar
  32. 32.
    D. Gondek and T. Hofmann. Non-redundant data clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 75–82, 2004.Google Scholar
  33. 33.
    D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 70–77, 2005.Google Scholar
  34. 34.
    S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection of orthogonal concepts in subspaces of high dimensional data. In 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1317–1326, 2009.Google Scholar
  35. 35.
    S. Günnemann, I. Färber, E. Müller, and T. Seidl. ASCLU: alternative subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.Google Scholar
  36. 36.
    S. Günnemann, I. Färber, E. Müller, I. Assent, and T. Seidl. External evaluation measures for subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1363–1372, 2011.Google Scholar
  37. 37.
    J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. ACM SIGMOD Record, 29(2):1–12, 2000.CrossRefGoogle Scholar
  38. 38.
    J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition, 2011.Google Scholar
  39. 39.
    J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto, 1975.Google Scholar
  40. 40.
    A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD), New York City, NY, pages 58–65, 1998.Google Scholar
  41. 41.
    A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In 26th International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, pages 506–515, 2000.Google Scholar
  42. 42.
    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Can shared-neighbor distances defeat the curse of dimensionality? In 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, pages 482–500, 2010.Google Scholar
  43. 43.
    A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.CrossRefGoogle Scholar
  44. 44.
    A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988.Google Scholar
  45. 45.
    A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.CrossRefGoogle Scholar
  46. 46.
    P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Statistical Analysis and Data Mining, 1(3):195–210, 2008.CrossRefMathSciNetGoogle Scholar
  47. 47.
    K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering high dimensional data. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Cavtat-Dubrovnik, Croatia, pages 241–252, 2003.Google Scholar
  48. 48.
    K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pages 246–257, 2004.Google Scholar
  49. 49.
    L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.Google Scholar
  50. 50.
    H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1–58, 2009.CrossRefGoogle Scholar
  51. 51.
    H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231–240, 2011.Google Scholar
  52. 52.
    H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351–364, 2012.Google Scholar
  53. 53.
    P. Kröger and A. Zimek. Subspace clustering techniques. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 2873–2875. Springer, 2009.Google Scholar
  54. 54.
    G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 1250–1254, 2007.Google Scholar
  55. 55.
    S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982.CrossRefMATHMathSciNetGoogle Scholar
  56. 56.
    J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, volume 1, pages 281–297, 1967.Google Scholar
  57. 57.
    M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly summarizing data with itemsets. In 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 573–581, 2011.Google Scholar
  58. 58.
    G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 414–425, 2006.Google Scholar
  59. 59.
    G. Moise, J. Sander, and M. Ester. Robust projected clustering. Knowledge and Information Systems (KAIS), 14(3):273–298, 2008.CrossRefMATHGoogle Scholar
  60. 60.
    G. Moise, A. Zimek, P. Kröger, H.-P. Kriegel, and J. Sander. Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS), 21(3):299–326, 2009.CrossRefGoogle Scholar
  61. 61.
    E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In 9th IEEE International Conference on Data Mining (ICDM), Miami, FL, pages 377–386, 2009.Google Scholar
  62. 62.
    E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. Dens-Est:density estimation for data mining in high dimensional spaces. In 9th SIAM International Conference on Data Mining (SDM), Sparks, NV, pages 173–184, 2009.Google Scholar
  63. 63.
    E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluating clustering in subspace projections of high dimensional data. In 35th International Conference on Very Large Data Bases (VLDB), Lyon, France, pages 1270–1281, 2009.Google Scholar
  64. 64.
    E. Müller, I. Assent, S. Günnemann, and T. Seidl. Scalable densitybased subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1077–1086, 2011.Google Scholar
  65. 65.
    H. S. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, 2001.Google Scholar
  66. 66.
    H. V. Nguyen, E. Müller, J. Vreeken, F. Keller, and K. Böhm. CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 198–206, 2013.Google Scholar
  67. 67.
    L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations, 6(1):90–105, 2004.CrossRefGoogle Scholar
  68. 68.
    N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 398–416, 1999.Google Scholar
  69. 69.
    J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. MaPle: A fast algorithm for maximal pattern-based clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 259–266, 2003.Google Scholar
  70. 70.
    J. M. Phillips, P. Raman, and S. Venkatasubramanian. Generating a diverse set of high-quality clusterings. In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pages 80–91, 2011.Google Scholar
  71. 71.
    C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo algorithm for fast projective clustering. In ACM International Conference on Management of Data (SIGMOD), Madison, WI, pages 418–427, 2002.Google Scholar
  72. 72.
    Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717–726, 2009.Google Scholar
  73. 73.
    C. E. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, 1949.Google Scholar
  74. 74.
    K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332–397, 2013.CrossRefMATHMathSciNetGoogle Scholar
  75. 75.
    P. H. A. Sneath. The application of computers to taxonomy. Journal of General Microbiology, 17:201–226, 1957.Google Scholar
  76. 76.
    M. Verleysen and D. François. The curse of dimensionality in data mining and time series prediction. In 8th International Work-Conference on Artificial Neural Networks (IWANN), Barcelona, Spain, pages 758–770, 2005.Google Scholar
  77. 77.
    D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In A. J. Cole, editor, Numerical Taxonomy, pages 282–311, 1969.Google Scholar
  78. 78.
    X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 314–323, 2005.Google Scholar
  79. 79.
    M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 689–692, 2003.Google Scholar
  80. 80.
    M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering, 17(2):176–189, 2005.CrossRefGoogle Scholar
  81. 81.
    M. J. Zaki, M. Peters, I. Assent, and T. Seidl. CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering, 60(1):51–70, 2007.Google Scholar
  82. 82.
    F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal frequent patterns by core pattern fusion. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 706–715, 2007.Google Scholar
  83. 83.
    A. Zimek. Clustering high-dimensional data. In C. C. Aggarwal and C. K. Reddy, editors, Data Clustering: Algorithms and Applications, chapter 9, pages 201–230. CRC Press, 2013.Google Scholar
  84. 84.
    A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.Google Scholar
  85. 85.
    A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Ludwig-Maximilians-Universität MünchenMunichGermany
  2. 2.Department of Computer ScienceAarhus UniversityAarhusDenmark
  3. 3.Max-Planck Institute for Informatics and Saarland UniversitySaarbrückenGermany

Personalised recommendations