A binary decision diagram based approach for mining frequent subsequences

Loekito, Elsa; Bailey, James; Pei, Jian

doi:10.1007/s10115-009-0252-9

A binary decision diagram based approach for mining frequent subsequences

Regular Paper
Published: 17 September 2009

Volume 24, pages 235–268, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Elsa Loekito¹,
James Bailey¹ &
Jian Pei²

425 Accesses
27 Citations
Explore all metrics

Abstract

Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. Explicitly creating database projections is thought to be a major computational bottleneck, but we will show in this paper that it can be beneficial when the appropriate data structure is used. Our technique uses a canonical directed acyclic graph as the sequence database representation, which can be represented as a binary decision diagram (BDD). In this paper, we introduce a new type of BDD, namely a sequence BDD (SeqBDD), and show how it can be used for efficiently mining frequent subsequences. A novel feature of the SeqBDD is its ability to share results between similar intermediate computations and avoid redundant computation. We perform an experimental study to compare the SeqBDD technique with existing pattern growth techniques, that are based on other data structures such as prefix trees. Our results show that a SeqBDD can be half as large as a prefix tree, especially when many similar sequences exist. In terms of mining time, it can be substantially more efficient when the support is low, the number of patterns is large, or the input sequences are long and highly similar.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17
Aloul FA, Mneimneh MN, Sakallah K (2002) ZBDD-based backtrack search SAT solver. In: International workshop on logic synthesis. University of Michigan
Baeza-Yates RA (1991) Searching subsequences. Theor Comput Sci 78(2): 363–376
Article MATH MathSciNet Google Scholar
Bryant RE (1986) Graph-based algorithms for boolean function manipulation. IEEE Trans Comput 35(8): 677–691
Article MATH Google Scholar
Bryant RE, Chen Y-A (1995) Verification of arithmetic circuits with binary moment diagrams. In: DAC’95: proceedings of the 32nd ACM/IEEE conference on design automation, pp 535–541
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2008) An optimized sequential pattern matching methodology for sequence classification. Knowl Inform Syst (KAIS) 19: 249–264
Article Google Scholar
Ezeife CI, Lu Y (2005) Mining web log sequential patterns with position coded pre-order linked WAP-tree. Int J Data Min Knowl Discov (DMKD) 10(1): 5–38
Article MathSciNet Google Scholar
Ezeife CI, Lu Y, Liu Y (2005) PLWAP sequential mining: open source code. In: OSDM’05: proceedings of the 1st international workshop on open source data mining, pp 26–35
Ferreira P, Azevedo AP (2005) Protein sequence classification through relevant sequences and bayes classifiers. In: Proceedings of progress in artificial intelligence, vol 3808, pp 236–247
Gergov J, Meinel C (1994) Efficient analysis and manipulation of OBDDs can be extended to FBDDs’. IEEE Trans Comput 43(10): 1197–1209
Article MATH Google Scholar
Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A, Chen Y-K, Dubey P (2005) Cache-conscious frequent pattern mining on a modern processor. In: Proceedings of the 31st international conference on very large data bases, pp 577–588
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1): 53–87
Article MathSciNet Google Scholar
Hirao M, Hoshino H, Shinohara A, Takeda M, Arikawa S (2000) A practical algorithm to find the best subsequence patterns. In: Proceedings of discovery science, pp 141–154
IBM (2006) Synthetic data generation code for association rules and sequential patterns. Intelligent information systems, IBM almaden research center. http://www.almaden.ibm.com/software/quest/resources
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inform Syst (KAIS) 11(3): 259–286
Article Google Scholar
Kurai R, Minato S, Zeugmann T (2007) N-gram analysis based on Zero-suppressed BDDs. In: New frontiers in artificial intelligence. Lecture notes in computer science, vol 4384
Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inform Syst (KAIS) 7(4): 499–514
Article MathSciNet Google Scholar
Loekito E, Bailey J (2006) Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: Proceedings of the 12th international conference on knowledge discovery and data mining (KDD), pp 307–316
Loekito E, Bailey J (2007) Are zero-suppressed binary decision diagrams good for mining frequent patterns in high dimensional datasets? In: Proceedings of the 6th Australasian data mining conference (AusDM), pp 139–150
Luo C, Chung SM (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inform Syst (KAIS) 15(2): 149–179
Article Google Scholar
Ma Q, Wang J, Sasha D, Wu C (2001) DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. IEEE Trans Syst Man Cybern Part C 31(4): 468–475
Article Google Scholar
Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Proceedings of the 2nd European symposium on principles of data mining and knowledge discovery, vol 1510, pp 176–184
Minato S (1993) Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proceedings of the 30th international conference on design automation, pp 272–277
Minato S (2001) Zero-suppressed BDDs and their applications. Int J Softw Tools Technol Transf (STTT) 3(2): 156–170
MATH Google Scholar
Minato S (2005) Finding simple disjoint decompositions in frequent itemset data using Zero-suppressed BDD. In: Proceedings of ICDM workshop on computational intelligence in data mining, pp 3–11
Minato S, Arimura H (2005) Combinatorial item set analysis based on Zero-suppressed BDDs. In: IEEE workshop on web information retrieval WIRI, pp 3–10
Minato S, Arimura H (2006) Frequent pattern mining and knowledge indexing based on Zero-suppressed BDDs. In: The 5th international workshop on knowledge discovery in inductive databases (KDID’06), pp 83–94
Mitasiunaite I, Boulicaut J-F (2006) Looking for monotonicity properties of a similarity constraint on sequences. In: Proceedings of the 2006 ACM symposium on applied computing, pp 546–552
NCBI (n.d.), Entrez, the life sciences search engine. http://www.ncbi.nlm.nih.gov/sites/entrez
Ossowski J, Baier C (2006) Symbolic reasoning with weighted and normalized decision diagrams. In: Proceedings of the 12th symposium on the integration of symbolic computation and mechanized reasoning, pp 35–96
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11): 1424–1440
Article Google Scholar
Pei J, Han J, Mortazavi-asl B, Zhu H (2000) Mining access patterns efficiently from web logs, In: PAKDD’00: proceedings of the 2000 Pacific-Asia conference on knowledge discovery and data mining, pp 396–407
Pei J, Han J, Want W (2002) Mining sequential patterns with constraints in large databases. In: Proceedings of the 11th international conference on information and knowledge management (CIKM), pp 18–25
She R, Chen F, Wang K, Ester M, Gardy JL, Brinkman FSL (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of the 9th international conference on knowledge discovery and data mining (KDD), Washington DC, pp 436–445
Sinnamon RM, Andrews J (1996) Quantitative fault tree analysis using binary decision diagrams. Eur J Autom 30(8): 1051–1073
Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International conference on extending database technology: advances in database technology, pp 3–17
Tzvetkov P, Yan X, Han J (2005) Tsp: mining top-k closed sequential patterns. Knowl Inform Syst (KAIS) 7(4): 438–457
Article Google Scholar
Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: ICDE’04 proceedings of the 20th international conference on data engineering, p 79
Yang X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Proceedings of the international conference on data mining (SDM), pp 166–177
Zaiane OR, Wang Y, Goebel R, Taylor G (2006) Frequent subsequence-based protein localization. In: Proceedings of the data mining for biomedical applications, pp 35–47
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2): 31–60
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

National ICT Australia (NICTA), Department of Computer Science and Software Engineering, University of Melbourne, Melbourne, VIC, Australia
Elsa Loekito & James Bailey
School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Jian Pei

Authors

Elsa Loekito
View author publications
You can also search for this author in PubMed Google Scholar
James Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James Bailey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loekito, E., Bailey, J. & Pei, J. A binary decision diagram based approach for mining frequent subsequences. Knowl Inf Syst 24, 235–268 (2010). https://doi.org/10.1007/s10115-009-0252-9

Download citation

Received: 03 April 2008
Revised: 05 April 2009
Accepted: 15 August 2009
Published: 17 September 2009
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10115-009-0252-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A binary decision diagram based approach for mining frequent subsequences

Abstract

Access this article

Similar content being viewed by others

Mining sequential patterns with itemset constraints

NetHAPP: High Average Utility Periodic Gapped Sequential Pattern Mining

A sequential tree approach for incremental sequential pattern mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A binary decision diagram based approach for mining frequent subsequences

Abstract

Access this article

Similar content being viewed by others

Mining sequential patterns with itemset constraints

NetHAPP: High Average Utility Periodic Gapped Sequential Pattern Mining

A sequential tree approach for incremental sequential pattern mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation