Abstract
Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. Explicitly creating database projections is thought to be a major computational bottleneck, but we will show in this paper that it can be beneficial when the appropriate data structure is used. Our technique uses a canonical directed acyclic graph as the sequence database representation, which can be represented as a binary decision diagram (BDD). In this paper, we introduce a new type of BDD, namely a sequence BDD (SeqBDD), and show how it can be used for efficiently mining frequent subsequences. A novel feature of the SeqBDD is its ability to share results between similar intermediate computations and avoid redundant computation. We perform an experimental study to compare the SeqBDD technique with existing pattern growth techniques, that are based on other data structures such as prefix trees. Our results show that a SeqBDD can be half as large as a prefix tree, especially when many similar sequences exist. In terms of mining time, it can be substantially more efficient when the support is low, the number of patterns is large, or the input sequences are long and highly similar.
Similar content being viewed by others
References
Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17
Aloul FA, Mneimneh MN, Sakallah K (2002) ZBDD-based backtrack search SAT solver. In: International workshop on logic synthesis. University of Michigan
Baeza-Yates RA (1991) Searching subsequences. Theor Comput Sci 78(2): 363–376
Bryant RE (1986) Graph-based algorithms for boolean function manipulation. IEEE Trans Comput 35(8): 677–691
Bryant RE, Chen Y-A (1995) Verification of arithmetic circuits with binary moment diagrams. In: DAC’95: proceedings of the 32nd ACM/IEEE conference on design automation, pp 535–541
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2008) An optimized sequential pattern matching methodology for sequence classification. Knowl Inform Syst (KAIS) 19: 249–264
Ezeife CI, Lu Y (2005) Mining web log sequential patterns with position coded pre-order linked WAP-tree. Int J Data Min Knowl Discov (DMKD) 10(1): 5–38
Ezeife CI, Lu Y, Liu Y (2005) PLWAP sequential mining: open source code. In: OSDM’05: proceedings of the 1st international workshop on open source data mining, pp 26–35
Ferreira P, Azevedo AP (2005) Protein sequence classification through relevant sequences and bayes classifiers. In: Proceedings of progress in artificial intelligence, vol 3808, pp 236–247
Gergov J, Meinel C (1994) Efficient analysis and manipulation of OBDDs can be extended to FBDDs’. IEEE Trans Comput 43(10): 1197–1209
Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A, Chen Y-K, Dubey P (2005) Cache-conscious frequent pattern mining on a modern processor. In: Proceedings of the 31st international conference on very large data bases, pp 577–588
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1): 53–87
Hirao M, Hoshino H, Shinohara A, Takeda M, Arikawa S (2000) A practical algorithm to find the best subsequence patterns. In: Proceedings of discovery science, pp 141–154
IBM (2006) Synthetic data generation code for association rules and sequential patterns. Intelligent information systems, IBM almaden research center. http://www.almaden.ibm.com/software/quest/resources
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inform Syst (KAIS) 11(3): 259–286
Kurai R, Minato S, Zeugmann T (2007) N-gram analysis based on Zero-suppressed BDDs. In: New frontiers in artificial intelligence. Lecture notes in computer science, vol 4384
Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inform Syst (KAIS) 7(4): 499–514
Loekito E, Bailey J (2006) Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD), pp 307–316
Loekito E, Bailey J (2007) Are zero-suppressed binary decision diagrams good for mining frequent patterns in high dimensional datasets? In: Proceedings of the 6th Australasian data mining conference (AusDM), pp 139–150
Luo C, Chung SM (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inform Syst (KAIS) 15(2): 149–179
Ma Q, Wang J, Sasha D, Wu C (2001) DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. IEEE Trans Syst Man Cybern Part C 31(4): 468–475
Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Proceedings of the 2nd European symposium on principles of data mining and knowledge discovery, vol 1510, pp 176–184
Minato S (1993) Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proceedings of the 30th international conference on design automation, pp 272–277
Minato S (2001) Zero-suppressed BDDs and their applications. Int J Softw Tools Technol Transf (STTT) 3(2): 156–170
Minato S (2005) Finding simple disjoint decompositions in frequent itemset data using Zero-suppressed BDD. In: Proceedings of ICDM workshop on computational intelligence in data mining, pp 3–11
Minato S, Arimura H (2005) Combinatorial item set analysis based on Zero-suppressed BDDs. In: IEEE workshop on web information retrieval WIRI, pp 3–10
Minato S, Arimura H (2006) Frequent pattern mining and knowledge indexing based on Zero-suppressed BDDs. In: The 5th international workshop on knowledge discovery in inductive databases (KDID’06), pp 83–94
Mitasiunaite I, Boulicaut J-F (2006) Looking for monotonicity properties of a similarity constraint on sequences. In: Proceedings of the 2006 ACM symposium on applied computing, pp 546–552
NCBI (n.d.), Entrez, the life sciences search engine. http://www.ncbi.nlm.nih.gov/sites/entrez
Ossowski J, Baier C (2006) Symbolic reasoning with weighted and normalized decision diagrams. In: Proceedings of the 12th symposium on the integration of symbolic computation and mechanized reasoning, pp 35–96
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11): 1424–1440
Pei J, Han J, Mortazavi-asl B, Zhu H (2000) Mining access patterns efficiently from web logs, In: PAKDD’00: proceedings of the 2000 Pacific-Asia conference on knowledge discovery and data mining, pp 396–407
Pei J, Han J, Want W (2002) Mining sequential patterns with constraints in large databases. In: Proceedings of the 11th international conference on information and knowledge management (CIKM), pp 18–25
She R, Chen F, Wang K, Ester M, Gardy JL, Brinkman FSL (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of the 9th international conference on knowledge discovery and data mining (KDD), Washington DC, pp 436–445
Sinnamon RM, Andrews J (1996) Quantitative fault tree analysis using binary decision diagrams. Eur J Autom 30(8): 1051–1073
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International conference on extending database technology: advances in database technology, pp 3–17
Tzvetkov P, Yan X, Han J (2005) Tsp: mining top-k closed sequential patterns. Knowl Inform Syst (KAIS) 7(4): 438–457
Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: ICDE’04 proceedings of the 20th international conference on data engineering, p 79
Yang X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Proceedings of the international conference on data mining (SDM), pp 166–177
Zaiane OR, Wang Y, Goebel R, Taylor G (2006) Frequent subsequence-based protein localization. In: Proceedings of the data mining for biomedical applications, pp 35–47
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2): 31–60
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Loekito, E., Bailey, J. & Pei, J. A binary decision diagram based approach for mining frequent subsequences. Knowl Inf Syst 24, 235–268 (2010). https://doi.org/10.1007/s10115-009-0252-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0252-9