Advertisement

Data Mining and Knowledge Discovery

, Volume 18, Issue 1, pp 1–29 | Cite as

CONTOUR: an efficient algorithm for discovering discriminating subsequences

  • Jianyong Wang
  • Yuzhou Zhang
  • Lizhu Zhou
  • George Karypis
  • Charu C. Aggarwal
Article

Abstract

In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

Keywords

Sequence mining Discriminating subsequence Summarization subsequence Clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases, Berlin, pp 81–92Google Scholar
  2. Aggarwal CC, Ta N, Wang J, Feng J, Zaki MJ (2007) XProj: a framework for projected structural clustering of XML documents. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, pp 46–55Google Scholar
  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of 11th international conference on data engineering, Taipei, pp 3–14Google Scholar
  4. Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, pp 429–435Google Scholar
  5. Bettini C, Wang X, Jajodia S (1998) Mining temporal relationships with multiple granularities in time sequences. Data Eng Bull 21(1): 32–38MathSciNetGoogle Scholar
  6. Casas-Garriga G (2005) Summarizing sequential data with closed partial orders. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 380–391Google Scholar
  7. Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms. MIT PressGoogle Scholar
  8. Dalamagas T, Cheng T, Winkel K, Sellis T (2006) A methodology for clustering XML documents by structure. Inform Syst 31(3): 187–228CrossRefGoogle Scholar
  9. Deshpande M, Karypis G (2002) Evaluation of techniques for classifying biological sequences. In: Proceedings of 6th Pacific-Asia conference on advances in knowledge discovery and data mining, Taipei, pp 417–431Google Scholar
  10. Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential PAttern mining with regular expression constraints. In: Proceedings of 25th international conference on very large data bases, Edinburgh, pp 223–234Google Scholar
  11. Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of 1st IEEE international conference on data mining, San Jose, pp 179–186Google Scholar
  12. Han J, Dong G, Yin Y (1999) Efficient mining of partial periodic patterns in time series database. In: Proceedings of 15th international conference on data engineering, Sydney, pp 106–115Google Scholar
  13. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, pp 355–359Google Scholar
  14. Ji X, Bailey J, Dong G (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 194–201Google Scholar
  15. Li C, Wang J (2008) Efficiently mining closed subsequences with gap constraints. In: Proceedings of 2008 SIAM international conference on data mining, AtlantaGoogle Scholar
  16. Li Z, Chen Z, Srinivasan S, Zhou Y (2004) C-Miner: mining block correlations in storage systems. In: Proceedings of USENIX conference on file and storage technologies, San Francisco, pp 173–186Google Scholar
  17. Li Z, Lu S, Myagmar S, Zhou Y (2006) CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Software Eng 32(3): 176–192CrossRefGoogle Scholar
  18. Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Proceedings of 1st international conference on knowledge discovery and data mining, MontrealGoogle Scholar
  19. Masseglia F, Cathala F, Poncelet P (1998) The psp approach for mining sequential patterns. In: Proceedings of 2nd european symposium on principles of data mining and knowledge discovery, Nantes, pp 176–184Google Scholar
  20. Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: Proceedings of 14th international conference on data engineering, Orlando, pp 412–421Google Scholar
  21. Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of 17th international conference on data engineering, Heidelberg, pp 215–224Google Scholar
  22. Pei J, Han J, Wang W (2002) Constraint-based sequential pattern mining in large databases. In: Proceedings of 2002 ACM CIKM international conference on information and knowledge management, McLean, pp 18–25Google Scholar
  23. Pei J, Liu J, Wang H, Wang K, Yu PS, Wang J (2005) Efficiently mining frequent closed partial orders. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 753–756Google Scholar
  24. Seno M, Karypis G (2002) SLPMiner: An algorithm for finding frequent sequential patterns using length-decreasing support constraint. In: Proceedings of 2nd IEEE international conference on data mining, Maebashi City, pp 418–425Google Scholar
  25. She R, Chen F, Wang K, Ester M et al. (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, pp 236–245Google Scholar
  26. Srikant R, Agrawal R (1996) Mining sequential patterns:generalizations and performance improvements. In: Proceedings of 5th international conference on extending database technology, Avignon, pp 3–17Google Scholar
  27. Wang J, Han J (2004) BIDE: Efficient mining of frequent closed sequences. In: Proceedings of 20th international conference on data engineering, Boston, pp 79–90Google Scholar
  28. Wang J, Karypis G (2004) SUMMARY: Efficiently summarizing transactions for clustering. In: Proceedings of 4th IEEE international conference on data mining, Brighton, pp 241–248Google Scholar
  29. Wang J, Karypis G (2005) HARMONY: Efficiently mining the best rules for classification. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 205–216Google Scholar
  30. Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large databases. In: Proceedings of 3rd SIAM international conference on data mining, San FranciscoGoogle Scholar
  31. Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of 19th international conference on data engineering, Bangalore, pp 101–112Google Scholar
  32. Yang J, Yu PS, Wang W, Han J (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of 2002 ACM SIGMOD international conference on management of data, Madison, pp 406–417Google Scholar
  33. Zaki M (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42: 31–60zbMATHCrossRefGoogle Scholar
  34. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, pp 103–114Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Jianyong Wang
    • 1
  • Yuzhou Zhang
    • 1
  • Lizhu Zhou
    • 1
  • George Karypis
    • 2
  • Charu C. Aggarwal
    • 3
  1. 1.Tsinghua UniversityBeijingChina
  2. 2.University of MinnesotaMinneapolisUSA
  3. 3.IBM T.J. Watson Research CenterHawthorneUSA

Personalised recommendations