Frequent Pattern Mining Algorithms for Data Clustering

Zimek, Arthur; Assent, Ira; Vreeken, Jilles

doi:10.1007/978-3-319-07821-2_16

Frequent Pattern Mining Algorithms for Data Clustering

Arthur Zimek³,
Ira Assent⁴ &
Jilles Vreeken⁵

Chapter
First Online: 01 January 2014

5850 Accesses
4 Citations
3 Altmetric

Abstract

Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field.

In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data. In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace cluster hierarchies. In 12th International Conference on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, pages 152–163, 2007.
Google Scholar
E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. Robust, complete, and efficient correlation clustering. In 7th SIAM International Conference on Data Mining (SDM), Minneapolis, MN, pages 413–418, 2007.
Google Scholar
C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 61–72, 1999.
Google Scholar
C. C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in high dimensional space. In 8th International Conference on Database Theory (ICDT), London, UK, pages 420–434, 2001.
Google Scholar
C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. Zaki. Xproj: a framework for projected structural clustering of xml documents. In 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pages 46–55, 2007.
Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, pages 487–499, 1994.
Google Scholar
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 94–105, 1998.
Google Scholar
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA, pages 49–60, 1999.
Google Scholar
I. Assent. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):340–350, 2012.
Google Scholar
I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: dimensionality unbiased subspace clustering. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 409–414, 2007.
Google Scholar
I. Assent, R. Krieger, E. Müller, and T. Seidl. EDSC: efficient density-based subspace clustering. In 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, pages 1093–1102, 2008.
Google Scholar
I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: indexing subspace clusters with in-process-removal of redundancy. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 719–724, 2008.
Google Scholar
I. Assent, E. Müller, S. Günnemann, R. Krieger, and T. Seidl. Less is more: Non-redundant subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.
Google Scholar
E. Bae and J. Bailey. COALA: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 53–62, 2006.
Google Scholar
C. Baumgartner, K. Kailing, H.-P. Kriegel, P. Kröger, and C. Plant. Subspace selection for clustering high-dimensional data. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 11–18, 2004.
Google Scholar
R. Bayardo. Efficiently mining long patterns from databases. In ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 85–93, 1998.
Google Scholar
K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 233–243, 1999.
Google Scholar
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 217–235, 1999.
Google Scholar
S. Bickel and T. Scheffer. Multi-view clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 19–26, 2004.
Google Scholar
R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Gold Coast, Australia, pages 160–172, 2013.
Google Scholar
C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 84–93, 1999.
Google Scholar
Y. Cui, X. Z. Fern, and J. G. Dy. Non-redundant multi-view clustering via orthogonalization. In 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 133–142, 2007.
Google Scholar
X. H. Dang and J. Bailey. Generation of alternative clusterings using the CAMI approach. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 118–129, 2010.
Google Scholar
I. Davidson and Z. Qi. Finding alternative clusterings using constraints. In 8th IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pages 773–778, 2008.
Google Scholar
I. Davidson, S. S. Ravi, and L. Shamis. A SAT-based framework for efficient constrained clustering. In 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 94–105, 2010.
Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 39(1):1–31, 1977.
MATH MathSciNet Google Scholar
R. J. Durrant and A. Kaban. When is ‘nearest neighbour’ meaningful: A converse theorem and implications. Journal of Complexity, 25(4):385–397, 2009.
Article MATH MathSciNet Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pages 226–231, 1996.
Google Scholar
I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.
Google Scholar
D. François, V. Wertz, and M. Verleysen. The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19(7):873–886, 2007.
Article Google Scholar
G. Gan, C. Ma, and J. Wu. Data Clustering. Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics (SIAM), 2007.
Google Scholar
D. Gondek and T. Hofmann. Non-redundant data clustering. In 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 75–82, 2004.
Google Scholar
D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 70–77, 2005.
Google Scholar
S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection of orthogonal concepts in subspaces of high dimensional data. In 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1317–1326, 2009.
Google Scholar
S. Günnemann, I. Färber, E. Müller, and T. Seidl. ASCLU: alternative subspace clustering. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.
Google Scholar
S. Günnemann, I. Färber, E. Müller, I. Assent, and T. Seidl. External evaluation measures for subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1363–1372, 2011.
Google Scholar
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. ACM SIGMOD Record, 29(2):1–12, 2000.
Article Google Scholar
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition, 2011.
Google Scholar
J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto, 1975.
Google Scholar
A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD), New York City, NY, pages 58–65, 1998.
Google Scholar
A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In 26th International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, pages 506–515, 2000.
Google Scholar
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Can shared-neighbor distances defeat the curse of dimensionality? In 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, pages 482–500, 2010.
Google Scholar
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.
Article Google Scholar
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988.
Google Scholar
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, 1999.
Article Google Scholar
P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparate clusterings. Statistical Analysis and Data Mining, 1(3):195–210, 2008.
Article MathSciNet Google Scholar
K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering high dimensional data. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Cavtat-Dubrovnik, Croatia, pages 241–252, 2003.
Google Scholar
K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In 4th SIAM International Conference on Data Mining (SDM), Lake Buena Vista, FL, pages 246–257, 2004.
Google Scholar
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.
Google Scholar
H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1):1–58, 2009.
Article Google Scholar
H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231–240, 2011.
Google Scholar
H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351–364, 2012.
Google Scholar
P. Kröger and A. Zimek. Subspace clustering techniques. In L. Liu and M. T. Ozsu, editors, Encyclopedia of Database Systems, pages 2873–2875. Springer, 2009.
Google Scholar
G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 1250–1254, 2007.
Google Scholar
S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982.
Article MATH MathSciNet Google Scholar
J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on Mathematics, Statistics, and Probabilistics, volume 1, pages 281–297, 1967.
Google Scholar
M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly summarizing data with itemsets. In 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 573–581, 2011.
Google Scholar
G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 414–425, 2006.
Google Scholar
G. Moise, J. Sander, and M. Ester. Robust projected clustering. Knowledge and Information Systems (KAIS), 14(3):273–298, 2008.
Article MATH Google Scholar
G. Moise, A. Zimek, P. Kröger, H.-P. Kriegel, and J. Sander. Subspace and projected clustering: Experimental evaluation and analysis. Knowledge and Information Systems (KAIS), 21(3):299–326, 2009.
Article Google Scholar
E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In 9th IEEE International Conference on Data Mining (ICDM), Miami, FL, pages 377–386, 2009.
Google Scholar
E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. Dens-Est:density estimation for data mining in high dimensional spaces. In 9th SIAM International Conference on Data Mining (SDM), Sparks, NV, pages 173–184, 2009.
Google Scholar
E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluating clustering in subspace projections of high dimensional data. In 35th International Conference on Very Large Data Bases (VLDB), Lyon, France, pages 1270–1281, 2009.
Google Scholar
E. Müller, I. Assent, S. Günnemann, and T. Seidl. Scalable densitybased subspace clustering. In 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, UK, pages 1077–1086, 2011.
Google Scholar
H. S. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In 1st SIAM International Conference on Data Mining (SDM), Chicago, IL, 2001.
Google Scholar
H. V. Nguyen, E. Müller, J. Vreeken, F. Keller, and K. Böhm. CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 198–206, 2013.
Google Scholar
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations, 6(1):90–105, 2004.
Article Google Scholar
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 398–416, 1999.
Google Scholar
J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. MaPle: A fast algorithm for maximal pattern-based clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 259–266, 2003.
Google Scholar
J. M. Phillips, P. Raman, and S. Venkatasubramanian. Generating a diverse set of high-quality clusterings. In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, pages 80–91, 2011.
Google Scholar
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A Monte Carlo algorithm for fast projective clustering. In ACM International Conference on Management of Data (SIGMOD), Madison, WI, pages 418–427, 2002.
Google Scholar
Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717–726, 2009.
Google Scholar
C. E. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, 1949.
Google Scholar
K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332–397, 2013.
Article MATH MathSciNet Google Scholar
P. H. A. Sneath. The application of computers to taxonomy. Journal of General Microbiology, 17:201–226, 1957.
Google Scholar
M. Verleysen and D. François. The curse of dimensionality in data mining and time series prediction. In 8th International Work-Conference on Artificial Neural Networks (IWANN), Barcelona, Spain, pages 758–770, 2005.
Google Scholar
D. Wishart. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In A. J. Cole, editor, Numerical Taxonomy, pages 282–311, 1969.
Google Scholar
X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. In 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 314–323, 2005.
Google Scholar
M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In 3rd IEEE International Conference on Data Mining (ICDM), Melbourne, FL, pages 689–692, 2003.
Google Scholar
M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE Transactions on Knowledge and Data Engineering, 17(2):176–189, 2005.
Article Google Scholar
M. J. Zaki, M. Peters, I. Assent, and T. Seidl. CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering, 60(1):51–70, 2007.
Google Scholar
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal frequent patterns by core pattern fusion. In 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, pages 706–715, 2007.
Google Scholar
A. Zimek. Clustering high-dimensional data. In C. C. Aggarwal and C. K. Reddy, editors, Data Clustering: Algorithms and Applications, chapter 9, pages 201–230. CRC Press, 2013.
Google Scholar
A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.
Google Scholar
A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012.
Google Scholar

Download references

Acknowledgments

Ira Assent is partly supported by the Danish Council for Independent Research—Technology and Production Sciences (FTP), grant 10-081972. Jilles Vreeken is supported by the Cluster of Excellence ‘Multimodal Computing and Interaction’ within the Excellence Initiative of the German Federal Government.

Author information

Authors and Affiliations

Ludwig-Maximilians-Universität München, Munich, Germany
Arthur Zimek
Department of Computer Science, Aarhus University, Aarhus, Denmark
Ira Assent
Max-Planck Institute for Informatics and Saarland University, Saarbrücken, Germany
Jilles Vreeken

Authors

Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar
Ira Assent
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Zimek .

Editor information

Editors and Affiliations

IBM, Yorktown Heights, New York, USA
Charu C. Aggarwal
University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
Jiawei Han

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zimek, A., Assent, I., Vreeken, J. (2014). Frequent Pattern Mining Algorithms for Data Clustering. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-07821-2_16
Published: 30 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07820-5
Online ISBN: 978-3-319-07821-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics