Advertisement

Knowledge and Information Systems

, Volume 11, Issue 3, pp 259–286 | Cite as

Mining minimal distinguishing subsequence patterns with gap constraints

  • Xiaonan Ji
  • James Bailey
  • Guozhu Dong
Regular Paper

Abstract

Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.

Keywords

Data mining algorithm Sequential pattern Frequent pattern Emerging pattern Gap constraint Contrast pattern 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawel R, Srikant R (1995) Mining sequential patterns. In: Yu PS, Chen ALP (eds) Proceedings of the 11th international conference on data engineering, IEEE computer society, Taipei, Taiwan, pp 3–14.Google Scholar
  2. 2.
    Antunes C, Oliveira AL (2003) Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In: Perner P, Rosenfeld A (eds) Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition, lecture notes in computer science, Leipzig, Germany, pp 239–251.Google Scholar
  3. 3.
    Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SlGKDD international conference on knowledge discovery and data mining, ACM, Edmonton, Alberta, Canada, pp 429–435.Google Scholar
  4. 4.
    Bailey J, Manoukian T, Ramamohanarao K (2003) Classification using constrained emerging patterns. In: Dong G, Tang C, Wang W (eds) Proceedings of the 4th international conference on advances in web-age information management, lecture notes in computer science, Chengdu, China, pp 226–237.Google Scholar
  5. 5.
    Bay SD, Pazzani MJ (2001) Detecting group differences: Mining contrast sets. Data Mining Know Disc 5:213–246.zbMATHCrossRefGoogle Scholar
  6. 6.
    Garriga GC (2003) Discovering unbounded episodes in sequential data. In: Lavrac N, Gamberger D, Blockeel H, Todorovski L (eds) Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Cavtat-Dubrovnik, Croatia, pp 83–94.Google Scholar
  7. 7.
    Chan S, Kao B, Yip CL et al (2003) Mining emerging substrings. In: Proceedings of the 8th international conference on database systems for advanced applications, IEEE Computer Society, Kyoto, Japan, pp 119–126.Google Scholar
  8. 8.
    Das G, Fleischer R, Gasieniec L et al (1997) Episode matching. In: Apostolico A, Hein J (eds) Proceedings of the 8th annual symposium on combinatorial pattern matching, lecture notes in computer science, Aarhus, Denmark, pp 12–27.Google Scholar
  9. 9.
    Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, San Diego, CA, USA, pp 43–52.CrossRefGoogle Scholar
  10. 10.
    Dong G, Li J (2005) Mining border descriptions of emerging patterns from dataset pairs. Know and Inf Syst 8:178–202.CrossRefGoogle Scholar
  11. 11.
    Dong G, Zhang X, Wong L et al (1999) CAEP: Classification by aggregating emerging patterns. In: Arikawa S, Furukawa K (eds) Proceedings of the 2nd international conference on discovery science, lecture notes in computer science, Tokyo, Japan, pp 30–42.Google Scholar
  12. 12.
    Fischer J, Raedt LD (2004) Towards optimizing conjunctive inductive queries. In: Dai H, Srikant R, Zhang C (eds) Proceedings of the 8th Pacific-Asia conference on advances in knowledge discovery and data mining, lecture notes in computer science, Sydney, Australia, pp 625–637.Google Scholar
  13. 13.
    Gusfield D (1997) Algorithms on strings, trees and sequences, computer science and computational biology. Cambridge University Press, Cambridge, pp 235–244.zbMATHGoogle Scholar
  14. 14.
    Han J, Pei J, Mortazavi-Asl B et al (2000) Freespan: frequent pattern-projected sequential pattern mining. In: proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Boston, MA, USA, pp 355–359.CrossRefGoogle Scholar
  15. 15.
    Hirao M, Hoshino H, Shinohara A et al (2003) A practical algorithm to find the best subsequence patterns. Theor Comp Sci 292:465–479.zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Ji X, Bailey J, Dong D (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the 5th IEEE international conference on data mining, IEEE Computer Society, Houston, Texas USA, pp 194–201.Google Scholar
  17. 17.
    Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Int Syst 15:48–56.CrossRefGoogle Scholar
  18. 18.
    Li J, Dong G, Ramamohanarao K (2001) Making use of the most expressive jumping emerging patterns for classification. Know Info Sys 3:131–145.CrossRefGoogle Scholar
  19. 19.
    Maier D (1978) The complexity of some problems on subsequences and supersequences. J ACM 25(2):322–336.zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Fayyad UM, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, AAAI Press, Montreal, Canada, pp 210–215.Google Scholar
  21. 21.
    Méger N, Rigotti C (2004) Constraint-based mining of episode rules and optimal window sizes. In: Boulicaut JF, Esposito F, Giannotti F, Pcdrcschi D (eds) Proceedings of the 8th european conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Pisa, Italy, pp 313–324.Google Scholar
  22. 22.
    Mitchell T (1982) Generalization as search. AI Journal 18(2):203–226.Google Scholar
  23. 23.
    Narasimhan G, Bu C, Gao Y et al (2002) Mining protein sequences for motifs. J Comp Bio 9(5):707–720.CrossRefGoogle Scholar
  24. 24.
    Pei J, Han J, Mortazavi-Asl B et al (2001) Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th international conference on data engineering, IEEE Computer Society, Heidelberg, Germany, pp 215–224.Google Scholar
  25. 25.
    Raedit LD, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Morgan Kaufmann, Seattle, Washington, USA, pp 853–862.Google Scholar
  26. 26.
    She R, Chen F, Wang K et al (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Gctoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 436–445.CrossRefGoogle Scholar
  27. 27.
    Tronícek Z (2001) Episode matching. In: Amir A, Landau GM (eds) Proceedings of the 12th annual symposium on combinatorial pattern matching, lecture notes in computer science, Jerusalem, Israel, pp 143–146.Google Scholar
  28. 28.
    Tzvetkov P, Yan X, Han J (2003) TSP: mining top-K closed sequential patterns. In: Proceedings of the 3rd IEEE international conference on data mining, IEEE Computer Society, Melbourne, Florida, USA, pp 347–354.CrossRefGoogle Scholar
  29. 29.
    Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data engineering, IEEE Computer Society, Boston, MA, USA, pp 79–90.CrossRefGoogle Scholar
  30. 30.
    Webb GI, Butler S, Newlands D (2003) On detecting differences between groups. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 256–265.CrossRefGoogle Scholar
  31. 31.
    Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Barbara D, Kamath C (eds) Proceedings of the 3rd SIAM international conference on data mining, SIAM, San Francisco, CA, USA.Google Scholar
  32. 32.
    Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 2000 ACM CIKM international conference on information and knowledge management, ACM, McLean, VA, USA, pp 422–429.CrossRefGoogle Scholar
  33. 33.
    Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1/2):31–60.zbMATHCrossRefGoogle Scholar
  34. 34.
    Zhang M, Kao B, Cheung D, Yip K (2005) Mining periodic patterns with gap requirement from sequences. In: Özcan F (ed) Proceedings of the 2005 ACM SIGMOD international conference on management of data, ACM, Maryland, USA, pp 622–633 .Google Scholar

Copyright information

© Springer-Verlag London Limited 2006

Authors and Affiliations

  1. 1.Department of Computer Science and Software Engineering, NICTA Victoria LaboratoryUniversity of MelbourneMelbourneAustralia
  2. 2.Department of Computer Science and EngineeringWright State UniversityDaytonUSA

Personalised recommendations