Skip to main content

Frequent Pattern Mining Algorithms for Data Clustering

Abstract

Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field.

In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data. In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining.

Keywords

  • Subspace clustering
  • Monotonicity
  • Redundancy

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-07821-2_16
  • Chapter length: 21 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-07821-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.99
Price excludes VAT (USA)
Hardcover Book
USD   219.99
Price excludes VAT (USA)
Fig. 16.1
Fig. 16.2
Fig. 16.3
Fig. 16.4
Fig. 16.5
Fig. 16.6
Fig. 16.7
Fig. 16.8

References

  1. E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace cluster hierarchies. In 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pages 152–163, 2007.

    Google Scholar 

  2. E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Robust, complete, and efficient correlation clustering. In 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pages 413–418, 2007.

    Google Scholar 

  3. C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 61–72, 1999.

    Google Scholar 

  4. C. C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in high dimensional space. In 8th International Conference on Database Theory (ICDT), London, UK, pages 420–434, 2001.

    Google Scholar 

  5. C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. Zaki. Xproj: a framework for projected structural clustering of xml documents. In 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pages 46–55, 2007.

    Google Scholar 

  6. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pages 487–499, 1994.

    Google Scholar 

  7. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 94–105, 1998.

    Google Scholar 

  8. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 49–60, 1999.

    Google Scholar 

  9. I. Assent. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):340–350, 2012.

    Google Scholar 

  10. I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: dimensionality unbiased subspace clustering. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 409–414, 2007.

    Google Scholar 

  11. I. Assent, R. Krieger, E. Müller, and T. Seidl. EDSC: efficient density-based subspace clustering. In 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pages 1093–1102, 2008.

    Google Scholar 

  12. I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: indexing subspace clusters with in-process-removal of redundancy. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 719–724, 2008.

    Google Scholar 

  13. I. Assent, E. Müller, S. Günnemann, R. Krieger, and T. Seidl. Less is more: Non-redundant subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.

    Google Scholar 

  14. E. Bae and J. Bailey. COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 53–62, 2006.

    Google Scholar 

  15. C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kröger, and C. Plant. Subspace selection for clustering high-dimensional data. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 11–18, 2004.

    Google Scholar 

  16. R. Bayardo. Efficiently mining long patterns from databases. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 85–93, 1998.

    Google Scholar 

  17. K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 233–243, 1999.

    Google Scholar 

  18. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 217–235, 1999.

    Google Scholar 

  19. S. Bickel and T. Scheffer. Multi-view clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 19–26, 2004.

    Google Scholar 

  20. R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Gold Coast, Australia, pages 160–172, 2013.

    Google Scholar 

  21. C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 84–93, 1999.

    Google Scholar 

  22. Y. Cui, X. Z. Fern, and J. G. Dy. Non-redundant multi-view clustering via orthogonalization. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 133–142, 2007.

    Google Scholar 

  23. X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 118–129, 2010.

    Google Scholar 

  24. I. Davidson and Z. Qi. Finding alternative clusterings using constraints. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 773–778, 2008.

    Google Scholar 

  25. I. Davidson, S. S. Ravi, and L. Shamis. A SAT-based framework for efficient constrained clustering. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 94–105, 2010.

    Google Scholar 

  26. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1):1–31, 1977.

    MATH  MathSciNet  Google Scholar 

  27. R. J. Durrant and A. Kaban. When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity, 25(4):385–397, 2009.

    CrossRef  MATH  MathSciNet  Google Scholar 

  28. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pages 226–231, 1996.

    Google Scholar 

  29. I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.

    Google Scholar 

  30. D. François, V. Wertz, and M. Verleysen. The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19(7):873–886, 2007.

    CrossRef  Google Scholar 

  31. G. Gan, C. Ma, and J. Wu. Data Clustering. Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics (SIAM), 2007.

    Google Scholar 

  32. D. Gondek and T. Hofmann. Non-redundant data clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 75–82, 2004.

    Google Scholar 

  33. D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 70–77, 2005.

    Google Scholar 

  34. S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection of orthogonal concepts in subspaces of high dimensional data. In 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1317–1326, 2009.

    Google Scholar 

  35. S. Günnemann, I. Färber, E. Müller, and T. Seidl. ASCLU: alternative subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.

    Google Scholar 

  36. S. Günnemann, I. Färber, E. Müller, I. Assent, and T. Seidl. External evaluation measures for subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1363–1372, 2011.

    Google Scholar 

  37. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. ACM SIGMOD Record, 29(2):1–12, 2000.

    CrossRef  Google Scholar 

  38. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition, 2011.

    Google Scholar 

  39. J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto, 1975.

    Google Scholar 

  40. A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD), New York City, NY, pages 58–65, 1998.

    Google Scholar 

  41. A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In 26th International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, pages 506–515, 2000.

    Google Scholar 

  42. M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Can shared-neighbor distances defeat the curse of dimensionality? In 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, pages 482–500, 2010.

    Google Scholar 

  43. A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

    CrossRef  Google Scholar 

  44. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988.

    Google Scholar 

  45. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.

    CrossRef  Google Scholar 

  46. P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Statistical Analysis and Data Mining, 1(3):195–210, 2008.

    CrossRef  MathSciNet  Google Scholar 

  47. K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering high dimensional data. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Cavtat-Dubrovnik, Croatia, pages 241–252, 2003.

    Google Scholar 

  48. K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pages 246–257, 2004.

    Google Scholar 

  49. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.

    Google Scholar 

  50. H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1–58, 2009.

    CrossRef  Google Scholar 

  51. H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231–240, 2011.

    Google Scholar 

  52. H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351–364, 2012.

    Google Scholar 

  53. P. Kröger and A. Zimek. Subspace clustering techniques. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 2873–2875. Springer, 2009.

    Google Scholar 

  54. G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 1250–1254, 2007.

    Google Scholar 

  55. S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982.

    CrossRef  MATH  MathSciNet  Google Scholar 

  56. J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, volume 1, pages 281–297, 1967.

    Google Scholar 

  57. M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly summarizing data with itemsets. In 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 573–581, 2011.

    Google Scholar 

  58. G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 414–425, 2006.

    Google Scholar 

  59. G. Moise, J. Sander, and M. Ester. Robust projected clustering. Knowledge and Information Systems (KAIS), 14(3):273–298, 2008.

    CrossRef  MATH  Google Scholar 

  60. G. Moise, A. Zimek, P. Kröger, H.-P. Kriegel, and J. Sander. Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS), 21(3):299–326, 2009.

    CrossRef  Google Scholar 

  61. E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In 9th IEEE International Conference on Data Mining (ICDM), Miami, FL, pages 377–386, 2009.

    Google Scholar 

  62. E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. Dens-Est:density estimation for data mining in high dimensional spaces. In 9th SIAM International Conference on Data Mining (SDM), Sparks, NV, pages 173–184, 2009.

    Google Scholar 

  63. E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluating clustering in subspace projections of high dimensional data. In 35th International Conference on Very Large Data Bases (VLDB), Lyon, France, pages 1270–1281, 2009.

    Google Scholar 

  64. E. Müller, I. Assent, S. Günnemann, and T. Seidl. Scalable densitybased subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1077–1086, 2011.

    Google Scholar 

  65. H. S. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, 2001.

    Google Scholar 

  66. H. V. Nguyen, E. Müller, J. Vreeken, F. Keller, and K. Böhm. CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 198–206, 2013.

    Google Scholar 

  67. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations, 6(1):90–105, 2004.

    CrossRef  Google Scholar 

  68. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 398–416, 1999.

    Google Scholar 

  69. J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. MaPle: A fast algorithm for maximal pattern-based clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 259–266, 2003.

    Google Scholar 

  70. J. M. Phillips, P. Raman, and S. Venkatasubramanian. Generating a diverse set of high-quality clusterings. In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pages 80–91, 2011.

    Google Scholar 

  71. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo algorithm for fast projective clustering. In ACM International Conference on Management of Data (SIGMOD), Madison, WI, pages 418–427, 2002.

    Google Scholar 

  72. Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717–726, 2009.

    Google Scholar 

  73. C. E. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, 1949.

    Google Scholar 

  74. K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332–397, 2013.

    CrossRef  MATH  MathSciNet  Google Scholar 

  75. P. H. A. Sneath. The application of computers to taxonomy. Journal of General Microbiology, 17:201–226, 1957.

    Google Scholar 

  76. M. Verleysen and D. François. The curse of dimensionality in data mining and time series prediction. In 8th International Work-Conference on Artificial Neural Networks (IWANN), Barcelona, Spain, pages 758–770, 2005.

    Google Scholar 

  77. D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In A. J. Cole, editor, Numerical Taxonomy, pages 282–311, 1969.

    Google Scholar 

  78. X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 314–323, 2005.

    Google Scholar 

  79. M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 689–692, 2003.

    Google Scholar 

  80. M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering, 17(2):176–189, 2005.

    CrossRef  Google Scholar 

  81. M. J. Zaki, M. Peters, I. Assent, and T. Seidl. CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering, 60(1):51–70, 2007.

    Google Scholar 

  82. F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal frequent patterns by core pattern fusion. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 706–715, 2007.

    Google Scholar 

  83. A. Zimek. Clustering high-dimensional data. In C. C. Aggarwal and C. K. Reddy, editors, Data Clustering: Algorithms and Applications, chapter 9, pages 201–230. CRC Press, 2013.

    Google Scholar 

  84. A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.

    Google Scholar 

  85. A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012.

    Google Scholar 

Download references

Acknowledgments

Ira Assent is partly supported by the Danish Council for Independent Research—Technology and Production Sciences (FTP), grant 10-081972. Jilles Vreeken is supported by the Cluster of Excellence ‘Multimodal Computing and Interaction’ within the Excellence Initiative of the German Federal Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arthur Zimek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Zimek, A., Assent, I., Vreeken, J. (2014). Frequent Pattern Mining Algorithms for Data Clustering. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07821-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07820-5

  • Online ISBN: 978-3-319-07821-2

  • eBook Packages: Computer ScienceComputer Science (R0)