Skip to main content

On Mining Maximal Pattern-Based Clusters

  • Chapter
Data Mining for Business Applications
  • 2008 Accesses

Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databases is still challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.

In this paper, we study the problem of maximal pattern-based clustering. The major idea is that the redundant clusters are avoided completely by mining only the maximal pattern-based clusters. We show that maximal pattern-based clusters are skylines of all pattern-based clusters. Two efficient algorithms, MaPle and MaPle+ (MaPle is for Maximal Pattern-based Clustering) are developed. The algorithms conduct a depth-first, progressively refining search and prune unpromising branches smartly. MaPle+ integrates several interesting heuristics further. Our extensive performance study on both synthetic data sets and real data sets shows that maximal pattern-based clustering is effective — it reduces the number of clusters substantially. Moreover, MaPle and MaPle+ are more efficient and scalable than the previously proposed pattern-based clustering methods in mining large databases, and MaPle,+ often performs better than MaPle.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, 2001.

    Article  MATH  Google Scholar 

  2. C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99),pages 61–72, Philadelphia, PA, June 1999.

    Google Scholar 

  3. C.C. Aggarwal and P.S. Yu. Finding generalized projected clusters in high dimensional spaces.In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 70–81,Dallas, TX, May 2000.

    Google Scholar 

  4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD'98), pages 94–105, Seattle, WA, June 1998.

    Google Scholar 

  5. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'93),pages 207–216, Washington, DC, May 1993.

    Google Scholar 

  6. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.Conf. Very Large Data Bases (VLDB'94), pages 487–499, Santiago, Chile, Sept. 1994.

    Google Scholar 

  7. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In C. Beeri and P. Buneman, editorsProceedings of the 7th International Conference on Database Theory (ICDT'99), pages 217–235, Berlin, Germany, January 1999.

    Google Scholar 

  8. C. H. Cheng, A. W-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99),pages 84–93, San Diego, CA, Aug. 1999.

    Google Scholar 

  9. Yizong Cheng and George M. Church. Biclustering of expression data. In Proc. of the 8th International Conference on Intelligent System for Molecular Biology, pages 93–103, 2000.

    Google Scholar 

  10. Mohammad El-Hajj and Osmar R. Zaïane. Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining. In KDD'03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages 109–. ACM Press, 2003.

    Google Scholar 

  11. B. Ganter and R. Wille. Formal Concept Analysis — Mathematical Foundations. Springer,1996.

    Google Scholar 

  12. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 1–12, Dallas, TX,May 2000.

    Google Scholar 

  13. H. V. Jagadish, J. Madar, and R. Ng. Semantic compression and pattern extraction with fascicles. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pages 186–197, Edinburgh,UK, Sept. 1999.

    Google Scholar 

  14. D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang. Mining coherent gene clusters from gene-sample-time microarray data. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'04), pages 430–439. ACM Press,2004.

    Google Scholar 

  15. Daxin Jiang, Jian Pei, and Aidong Zhang. DHC: A density-based hierarchical clustering method for gene expression data. In The Third IEEE Symposium on Bioinformatics and Bio-engineering (BIBE'03), Washington D.C., March 2003.

    Google Scholar 

  16. Guimei Liu, Hongjun Lu, Wenwu Lou, and Jeffrey Xu Yu. On computing, storing and querying frequent patterns. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 607–612. ACM Press, 2003.

    Google Scholar 

  17. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. 2002 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'02),pages 229–238, Edmonton, Alberta, Canada, July 2002.

    Chapter  Google Scholar 

  18. J. Liu and W. Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne,Florida, Nov. 2003. IEEE.

    Google Scholar 

  19. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. Database Theory (ICDT'99), pages 398–416,Jerusalem, Israel, Jan. 1999.

    Google Scholar 

  20. J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, Nov. 2003. IEEE.

    Google Scholar 

  21. S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set. In http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.

    Google Scholar 

  22. H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI,June 2002.

    Google Scholar 

  23. Jiong Yang, Wei Wang, Haixun Wang, and Philip S. Yu. δ-cluster: Capturing subspace correlation in a large data set. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), San Fransisco,CA, April 2002.

    Google Scholar 

  24. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97),pages 283–286, Newport Beach, CA, Aug. 1997.

    Google Scholar 

  25. L. Zhao and M. Zaki. Tricluster: An effective algorithm for mining coherent clusters in 3d mi-croarray data. In Proc. 2005 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philip S. Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S. (2009). On Mining Maximal Pattern-Based Clusters. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds) Data Mining for Business Applications. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-79420-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-79420-4_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-79419-8

  • Online ISBN: 978-0-387-79420-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics