Skip to main content
Log in

Mining minimal distinguishing subsequence patterns with gap constraints

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawel R, Srikant R (1995) Mining sequential patterns. In: Yu PS, Chen ALP (eds) Proceedings of the 11th international conference on data engineering, IEEE computer society, Taipei, Taiwan, pp 3–14.

    Google Scholar 

  2. Antunes C, Oliveira AL (2003) Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In: Perner P, Rosenfeld A (eds) Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition, lecture notes in computer science, Leipzig, Germany, pp 239–251.

  3. Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SlGKDD international conference on knowledge discovery and data mining, ACM, Edmonton, Alberta, Canada, pp 429–435.

    Google Scholar 

  4. Bailey J, Manoukian T, Ramamohanarao K (2003) Classification using constrained emerging patterns. In: Dong G, Tang C, Wang W (eds) Proceedings of the 4th international conference on advances in web-age information management, lecture notes in computer science, Chengdu, China, pp 226–237.

  5. Bay SD, Pazzani MJ (2001) Detecting group differences: Mining contrast sets. Data Mining Know Disc 5:213–246.

    Article  MATH  Google Scholar 

  6. Garriga GC (2003) Discovering unbounded episodes in sequential data. In: Lavrac N, Gamberger D, Blockeel H, Todorovski L (eds) Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Cavtat-Dubrovnik, Croatia, pp 83–94.

    Google Scholar 

  7. Chan S, Kao B, Yip CL et al (2003) Mining emerging substrings. In: Proceedings of the 8th international conference on database systems for advanced applications, IEEE Computer Society, Kyoto, Japan, pp 119–126.

    Google Scholar 

  8. Das G, Fleischer R, Gasieniec L et al (1997) Episode matching. In: Apostolico A, Hein J (eds) Proceedings of the 8th annual symposium on combinatorial pattern matching, lecture notes in computer science, Aarhus, Denmark, pp 12–27.

    Google Scholar 

  9. Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, San Diego, CA, USA, pp 43–52.

    Chapter  Google Scholar 

  10. Dong G, Li J (2005) Mining border descriptions of emerging patterns from dataset pairs. Know and Inf Syst 8:178–202.

    Article  Google Scholar 

  11. Dong G, Zhang X, Wong L et al (1999) CAEP: Classification by aggregating emerging patterns. In: Arikawa S, Furukawa K (eds) Proceedings of the 2nd international conference on discovery science, lecture notes in computer science, Tokyo, Japan, pp 30–42.

  12. Fischer J, Raedt LD (2004) Towards optimizing conjunctive inductive queries. In: Dai H, Srikant R, Zhang C (eds) Proceedings of the 8th Pacific-Asia conference on advances in knowledge discovery and data mining, lecture notes in computer science, Sydney, Australia, pp 625–637.

  13. Gusfield D (1997) Algorithms on strings, trees and sequences, computer science and computational biology. Cambridge University Press, Cambridge, pp 235–244.

    MATH  Google Scholar 

  14. Han J, Pei J, Mortazavi-Asl B et al (2000) Freespan: frequent pattern-projected sequential pattern mining. In: proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Boston, MA, USA, pp 355–359.

    Chapter  Google Scholar 

  15. Hirao M, Hoshino H, Shinohara A et al (2003) A practical algorithm to find the best subsequence patterns. Theor Comp Sci 292:465–479.

    Article  MATH  MathSciNet  Google Scholar 

  16. Ji X, Bailey J, Dong D (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the 5th IEEE international conference on data mining, IEEE Computer Society, Houston, Texas USA, pp 194–201.

    Google Scholar 

  17. Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Int Syst 15:48–56.

    Article  Google Scholar 

  18. Li J, Dong G, Ramamohanarao K (2001) Making use of the most expressive jumping emerging patterns for classification. Know Info Sys 3:131–145.

    Article  Google Scholar 

  19. Maier D (1978) The complexity of some problems on subsequences and supersequences. J ACM 25(2):322–336.

    Article  MATH  MathSciNet  Google Scholar 

  20. Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Fayyad UM, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, AAAI Press, Montreal, Canada, pp 210–215.

    Google Scholar 

  21. Méger N, Rigotti C (2004) Constraint-based mining of episode rules and optimal window sizes. In: Boulicaut JF, Esposito F, Giannotti F, Pcdrcschi D (eds) Proceedings of the 8th european conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Pisa, Italy, pp 313–324.

  22. Mitchell T (1982) Generalization as search. AI Journal 18(2):203–226.

    Google Scholar 

  23. Narasimhan G, Bu C, Gao Y et al (2002) Mining protein sequences for motifs. J Comp Bio 9(5):707–720.

    Article  Google Scholar 

  24. Pei J, Han J, Mortazavi-Asl B et al (2001) Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th international conference on data engineering, IEEE Computer Society, Heidelberg, Germany, pp 215–224.

    Google Scholar 

  25. Raedit LD, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Morgan Kaufmann, Seattle, Washington, USA, pp 853–862.

    Google Scholar 

  26. She R, Chen F, Wang K et al (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Gctoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 436–445.

    Chapter  Google Scholar 

  27. Tronícek Z (2001) Episode matching. In: Amir A, Landau GM (eds) Proceedings of the 12th annual symposium on combinatorial pattern matching, lecture notes in computer science, Jerusalem, Israel, pp 143–146.

  28. Tzvetkov P, Yan X, Han J (2003) TSP: mining top-K closed sequential patterns. In: Proceedings of the 3rd IEEE international conference on data mining, IEEE Computer Society, Melbourne, Florida, USA, pp 347–354.

    Chapter  Google Scholar 

  29. Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data engineering, IEEE Computer Society, Boston, MA, USA, pp 79–90.

    Chapter  Google Scholar 

  30. Webb GI, Butler S, Newlands D (2003) On detecting differences between groups. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 256–265.

    Chapter  Google Scholar 

  31. Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Barbara D, Kamath C (eds) Proceedings of the 3rd SIAM international conference on data mining, SIAM, San Francisco, CA, USA.

    Google Scholar 

  32. Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 2000 ACM CIKM international conference on information and knowledge management, ACM, McLean, VA, USA, pp 422–429.

    Chapter  Google Scholar 

  33. Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1/2):31–60.

    Article  MATH  Google Scholar 

  34. Zhang M, Kao B, Cheung D, Yip K (2005) Mining periodic patterns with gap requirement from sequences. In: Özcan F (ed) Proceedings of the 2005 ACM SIGMOD international conference on management of data, ACM, Maryland, USA, pp 622–633 .

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Bailey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, X., Bailey, J. & Dong, G. Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11, 259–286 (2007). https://doi.org/10.1007/s10115-006-0038-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0038-2

Keywords

Navigation