Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databases is still challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.
In this paper, we study the problem of maximal pattern-based clustering. The major idea is that the redundant clusters are avoided completely by mining only the maximal pattern-based clusters. We show that maximal pattern-based clusters are skylines of all pattern-based clusters. Two efficient algorithms, MaPle and MaPle+ (MaPle is for Maximal Pattern-based Clustering) are developed. The algorithms conduct a depth-first, progressively refining search and prune unpromising branches smartly. MaPle+ integrates several interesting heuristics further. Our extensive performance study on both synthetic data sets and real data sets shows that maximal pattern-based clustering is effective — it reduces the number of clusters substantially. Moreover, MaPle and MaPle+ are more efficient and scalable than the previously proposed pattern-based clustering methods in mining large databases, and MaPle,+ often performs better than MaPle.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, 2001.
C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99),pages 61–72, Philadelphia, PA, June 1999.
C.C. Aggarwal and P.S. Yu. Finding generalized projected clusters in high dimensional spaces.In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 70–81,Dallas, TX, May 2000.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD'98), pages 94–105, Seattle, WA, June 1998.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'93),pages 207–216, Washington, DC, May 1993.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.Conf. Very Large Data Bases (VLDB'94), pages 487–499, Santiago, Chile, Sept. 1994.
K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In C. Beeri and P. Buneman, editorsProceedings of the 7th International Conference on Database Theory (ICDT'99), pages 217–235, Berlin, Germany, January 1999.
C. H. Cheng, A. W-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99),pages 84–93, San Diego, CA, Aug. 1999.
Yizong Cheng and George M. Church. Biclustering of expression data. In Proc. of the 8th International Conference on Intelligent System for Molecular Biology, pages 93–103, 2000.
Mohammad El-Hajj and Osmar R. Zaïane. Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining. In KDD'03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages 109–. ACM Press, 2003.
B. Ganter and R. Wille. Formal Concept Analysis — Mathematical Foundations. Springer,1996.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 1–12, Dallas, TX,May 2000.
H. V. Jagadish, J. Madar, and R. Ng. Semantic compression and pattern extraction with fascicles. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pages 186–197, Edinburgh,UK, Sept. 1999.
D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang. Mining coherent gene clusters from gene-sample-time microarray data. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'04), pages 430–439. ACM Press,2004.
Daxin Jiang, Jian Pei, and Aidong Zhang. DHC: A density-based hierarchical clustering method for gene expression data. In The Third IEEE Symposium on Bioinformatics and Bio-engineering (BIBE'03), Washington D.C., March 2003.
Guimei Liu, Hongjun Lu, Wenwu Lou, and Jeffrey Xu Yu. On computing, storing and querying frequent patterns. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 607–612. ACM Press, 2003.
J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. 2002 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'02),pages 229–238, Edmonton, Alberta, Canada, July 2002.
J. Liu and W. Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne,Florida, Nov. 2003. IEEE.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. Database Theory (ICDT'99), pages 398–416,Jerusalem, Israel, Jan. 1999.
J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, Nov. 2003. IEEE.
S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set. In http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI,June 2002.
Jiong Yang, Wei Wang, Haixun Wang, and Philip S. Yu. δ-cluster: Capturing subspace correlation in a large data set. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), San Fransisco,CA, April 2002.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97),pages 283–286, Newport Beach, CA, Aug. 1997.
L. Zhao and M. Zaki. Tricluster: An effective algorithm for mining coherent clusters in 3d mi-croarray data. In Proc. 2005 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S. (2009). On Mining Maximal Pattern-Based Clusters. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_3
Download citation
DOI: https://doi.org/10.1007/978-0-387-79420-4_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-79419-8
Online ISBN: 978-0-387-79420-4
eBook Packages: Computer ScienceComputer Science (R0)