On Mining Maximal Pattern-Based Clusters

Pei, Jian; Zhang, Xiaoling; Cho, Moonjung; Wang, Haixun; Yu, Philip S.

doi:10.1007/978-0-387-79420-4_3

Jian Pei,
Xiaoling Zhang⁴,
Moonjung Cho,
Haixun Wang &
…
Philip S. Yu

2008 Accesses

Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databases is still challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.

In this paper, we study the problem of maximal pattern-based clustering. The major idea is that the redundant clusters are avoided completely by mining only the maximal pattern-based clusters. We show that maximal pattern-based clusters are skylines of all pattern-based clusters. Two efficient algorithms, MaPle and MaPle+ (MaPle is for Maximal Pattern-based Clustering) are developed. The algorithms conduct a depth-first, progressively refining search and prune unpromising branches smartly. MaPle+ integrates several interesting heuristics further. Our extensive performance study on both synthetic data sets and real data sets shows that maximal pattern-based clustering is effective — it reduces the number of clusters substantially. Moreover, MaPle and MaPle+ are more efficient and scalable than the previously proposed pattern-based clustering methods in mining large databases, and MaPle,+ often performs better than MaPle.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, 2001.
Article MATH Google Scholar
C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99),pages 61–72, Philadelphia, PA, June 1999.
Google Scholar
C.C. Aggarwal and P.S. Yu. Finding generalized projected clusters in high dimensional spaces.In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 70–81,Dallas, TX, May 2000.
Google Scholar
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD'98), pages 94–105, Seattle, WA, June 1998.
Google Scholar
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'93),pages 207–216, Washington, DC, May 1993.
Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.Conf. Very Large Data Bases (VLDB'94), pages 487–499, Santiago, Chile, Sept. 1994.
Google Scholar
K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In C. Beeri and P. Buneman, editorsProceedings of the 7th International Conference on Database Theory (ICDT'99), pages 217–235, Berlin, Germany, January 1999.
Google Scholar
C. H. Cheng, A. W-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99),pages 84–93, San Diego, CA, Aug. 1999.
Google Scholar
Yizong Cheng and George M. Church. Biclustering of expression data. In Proc. of the 8th International Conference on Intelligent System for Molecular Biology, pages 93–103, 2000.
Google Scholar
Mohammad El-Hajj and Osmar R. Zaïane. Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining. In KDD'03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages 109–. ACM Press, 2003.
Google Scholar
B. Ganter and R. Wille. Formal Concept Analysis — Mathematical Foundations. Springer,1996.
Google Scholar
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 1–12, Dallas, TX,May 2000.
Google Scholar
H. V. Jagadish, J. Madar, and R. Ng. Semantic compression and pattern extraction with fascicles. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pages 186–197, Edinburgh,UK, Sept. 1999.
Google Scholar
D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang. Mining coherent gene clusters from gene-sample-time microarray data. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'04), pages 430–439. ACM Press,2004.
Google Scholar
Daxin Jiang, Jian Pei, and Aidong Zhang. DHC: A density-based hierarchical clustering method for gene expression data. In The Third IEEE Symposium on Bioinformatics and Bio-engineering (BIBE'03), Washington D.C., March 2003.
Google Scholar
Guimei Liu, Hongjun Lu, Wenwu Lou, and Jeffrey Xu Yu. On computing, storing and querying frequent patterns. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 607–612. ACM Press, 2003.
Google Scholar
J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. 2002 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'02),pages 229–238, Edmonton, Alberta, Canada, July 2002.
Chapter Google Scholar
J. Liu and W. Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne,Florida, Nov. 2003. IEEE.
Google Scholar
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. Database Theory (ICDT'99), pages 398–416,Jerusalem, Israel, Jan. 1999.
Google Scholar
J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, Nov. 2003. IEEE.
Google Scholar
S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set. In http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
Google Scholar
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI,June 2002.
Google Scholar
Jiong Yang, Wei Wang, Haixun Wang, and Philip S. Yu. δ-cluster: Capturing subspace correlation in a large data set. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), San Fransisco,CA, April 2002.
Google Scholar
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97),pages 283–286, Newport Beach, CA, Aug. 1997.
Google Scholar
L. Zhao and M. Zaki. Tricluster: An effective algorithm for mining coherent clusters in 3d mi-croarray data. In Proc. 2005 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005.
Google Scholar

Download references

Author information

Authors and Affiliations

Boston University, Boston, USA
Xiaoling Zhang

Authors

Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Moonjung Cho
View author publications
You can also search for this author in PubMed Google Scholar
Haixun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philip S. Yu .

Editor information

Editors and Affiliations

School of Software Faculty of Engineering and Information Technology, University of Technology, PO Box 123, Sydney, Broadway, NSW 2007, Australia
Longbing Cao & Huaifeng Zhang &
Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan St., Chicago, IL, 60607
Philip S. Yu
Centre for Quantum Computation and Intelligent Systems Faculty of Engineering and Information Technology, University of Technology, PO Box 123, Sydney, Broadway, NSW 2007, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S. (2009). On Mining Maximal Pattern-Based Clusters. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_3

Download citation

DOI: https://doi.org/10.1007/978-0-387-79420-4_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-79419-8
Online ISBN: 978-0-387-79420-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics