Data Mining and Knowledge Discovery

, Volume 17, Issue 3, pp 457–495 | Cite as

An integrated, generic approach to pattern mining: data mining template library

  • Vineet Chaoji
  • Mohammad Al Hasan
  • Saeed Salem
  • Mohammed J. Zaki
Article

Abstract

Frequent pattern mining (FPM) is an important data mining paradigm to extract informative patterns like itemsets, sequences, trees, and graphs. However, no practical framework for integrating the FPM tasks has been attempted. In this paper, we describe the design and implementation of the Data Mining Template Library (DMTL) for FPM. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. It uses a novel pattern property hierarchy to define and mine different pattern types. This property hierarchy can be thought of as a systematic characterization of the pattern space, i.e., a meta-pattern specification that allows the analyst to specify new pattern types, by extending this hierarchy. Furthermore, in DMTL all aspects of mining are controlled by a set of different mining properties. For example, the kind of mining approach to use, the kind of data types and formats to mine over, the kind of back-end storage manager to use, are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications. Flexibility of the toolkit is exemplified by the ease with which support for a new pattern can be added. Experiments on synthetic and public dataset are conducted to demonstrate the scalability provided by the persistent back-end in the library. DMTL been publicly released as open-source software (http://dmtl.sourceforge.net/), and has been downloaded by numerous researchers from all over the world.

Keywords

Frequent pattern mining Itemset mining Sequence mining Tree mining Graph mining Generic programming 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD conference on management of dataGoogle Scholar
  2. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 307–328Google Scholar
  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: 11th International conference on data engineeringGoogle Scholar
  4. Antunes C, Oliveira AL (2004) Sequential pattern mining algorithms: Trade-offs between speed and memory. In: 2nd International workshop on mining graphs, trees and sequences with ECML/PKDDGoogle Scholar
  5. Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data miningGoogle Scholar
  6. Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th International conference on discovery scienceGoogle Scholar
  7. Ayres J, Flannick J, Gehrke JE, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  8. Balcazar JL, Casas-Garriga G (2005) On horn axiomatizations for sequential data. In: 10th International confererence on database theoryGoogle Scholar
  9. Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM, New York, USA, pp 85–93Google Scholar
  10. Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD conference on management of dataGoogle Scholar
  11. Buehrer G, Parthasarathy S, Ghoting A (2006). Out-of-core frequent pattern mining on a commodity PC. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  12. Burdick D, Calimlim M, Gehrke J (2001a) MAFIA: a maximal frequent itemset algorithm for transactional databases. In: IEEE international conference on data engineeringGoogle Scholar
  13. Burdick D, Calimlim M, Gehrke J (2001b) MAFIA: a maximal frequent itemset algorithm for transactional databases. In: 17th International conference on data engineeringGoogle Scholar
  14. Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: 3rd IEEE international conference on data miningGoogle Scholar
  15. Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: an efficient algorihtm for mining frequent rooted trees and free trees using canonical forms. In: 16th International conference on scientific and statistical database managementGoogle Scholar
  16. Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: mining both closed and maximal frequent subtrees. In: 8th Pacific-Asia conferernce on knowledge discovery and data miningGoogle Scholar
  17. Cook D, Holder L (1994) Substructure discovery using minimal description length and background knowledge. J Arti Intell Res 1: 231–255Google Scholar
  18. Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: 4th ACM SIGKDD international conference knowledge discovery and data miningGoogle Scholar
  19. Ganter B, Wille R (1999) Formal concept analysis: mathematical foundations. Springer-VerlagGoogle Scholar
  20. Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraints. In: 25th International conference on very large data basesGoogle Scholar
  21. Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A, Chen Y-K, Dubey P (2005) Cache-conscious frequent pattern mining on a modern processor. In: 31st International conference on very large data basesGoogle Scholar
  22. Goethals B, Zaki MJ (2003) Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor 6: 109–117CrossRefGoogle Scholar
  23. Gschwind T (2001) PSTL—A C++ Persistent Standard Template Library. In: 6th USENIX conference on object-oriented technologies and systemsGoogle Scholar
  24. Han J, Pei J, Yin Y (2000a) Mining frequent patterns without candidate generation. In: ACM SIGMOD conference on management of dataGoogle Scholar
  25. Han J, Pei J, Yin Y (2000b). Mining frequent patterns without candidate generation. In: ACM SIGMOD conference on management of dataGoogle Scholar
  26. Hasan MA, Chaoji V, Salem S, Zaki M (2005) DMTL: A generic Data Mining Template Library. In: 1st Workshop on library-centric software design (with OOPSLA)Google Scholar
  27. Horváth T, Ramon J, Wrobel S (2006) Frequent subgraph mining in outerplanar graphs. In: 12th ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  28. Huan J, Wang W, Prins J (2003a) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data miningGoogle Scholar
  29. Huan J, Wang W, Prins J (2003b) Efficient mining of frequent subgraphs in the presence of isomorphism (Technical report TR03-021). University of North CarolinaGoogle Scholar
  30. Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: 4th European conference on principles of knowledge discovery and data miningGoogle Scholar
  31. Inokuchi A, Washio T, Motoda H (2003) Complete mining of frequent patterns from graphs: Mining graph data. Machine Learn 50: 321–354MATHCrossRefGoogle Scholar
  32. Kramer S, Raedt LD, Helma C (2001) Molecular feature mining in HIV data. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  33. Kuramochi M, Karypis G (2001). Frequent subgraph discovery. In: 1st IEEE international conference on data miningGoogle Scholar
  34. Kuramochi M, Karypis G (2004) An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering 16: 1038–1051CrossRefGoogle Scholar
  35. Mannila H, Toivonen H. (1996) Discovering generalized episodes using minimal occurences. In: 2nd International conference knowledge discovery and data miningGoogle Scholar
  36. Mannila H, Toivonen H. (1997) Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1: 241–258CrossRefGoogle Scholar
  37. Mannila H, Toivonen H, Verkamo I (1995) Discovering frequent episodes in sequences. In: 1st International conference knowledge discovery and data miningGoogle Scholar
  38. Musser D, Derge G, Saini A. (2001) STL tutorial and reference guide, 2nd edition. Addison-WesleyGoogle Scholar
  39. Nijssen S, Kok J (2003) Efficient discovery of frequent unordered trees. In: 1st Internationall workshop on mining graphs, trees and sequencesGoogle Scholar
  40. Nijssen S, Kok J (2004) A quickstart in frequent structure mining can make a difference. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  41. Oates T, Schmill MD, Jensen D, Cohen PR (1997) A family of algorithms for finding temporal structure in data. In: 6th International workshop on AI and statisticsGoogle Scholar
  42. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: 7th International conference on database theoryGoogle Scholar
  43. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001). PrefixSpan: mining sequential patterns efficiently by prefixprojected pattern growth. In: IEEE international conference on data engineeringGoogle Scholar
  44. Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: 21st International conference on very large data basesGoogle Scholar
  45. Shasha D, Wang J, Zhang S (2004) Unordered tree mining with applications to phylogeny. In: IEEE international conference on data engineeringGoogle Scholar
  46. Siek J, Lee L, Lumsdaine A (2002). The boost graph library. Addison-WesleyGoogle Scholar
  47. Srikant R, Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: 5th International conference extending database technologyGoogle Scholar
  48. Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: IEEE international conference on data miningGoogle Scholar
  49. Termier A, Rousset M-C, Sebag M (2004) Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases. In: IEEE international conference on data miningGoogle Scholar
  50. Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23: 31–CrossRefMathSciNetGoogle Scholar
  51. Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Pacific-Asia conference on knowledge discovery and data miningGoogle Scholar
  52. Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: IEEE international conference on data engineeringGoogle Scholar
  53. Wang J, Han J, Pei J. (2003). CLOSET+: searching for the best strategies for mining frequent closed itemsets. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  54. Wang K, Liu H (1998) Discovering typical structures of documents: a road map approach. In: ACM SIGIR international conference on information retrievalGoogle Scholar
  55. Witten I, Frank E (1999) Data mining: practical machine learning tools and techniques with java implementations. Morgan KauffmanGoogle Scholar
  56. Xiao Y, Yao J-F, Li Z, Dunham MH (2003) Efficient data mining for maximal frequent subtrees. In: IEEE international conference on data miningGoogle Scholar
  57. Yan X, Han J (2002a) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data miningGoogle Scholar
  58. Yan X, Han J (2002b) gSpan: graph-based substructure pattern mining (Technical report UIUCDCS-R-2002-2296). University of Illinois at Urbana-ChampaignGoogle Scholar
  59. Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  60. Yoshida K, Motoda H (1995) CLIP: concept learning from inference patterns. Artif Intel 75: 63–92CrossRefGoogle Scholar
  61. Zaki MJ (2000a) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12: 372–390CrossRefGoogle Scholar
  62. Zaki MJ (2000b) Sequences mining in categorical domains: Incorporating constraints. In: 9th International conference on information and knowledge managementGoogle Scholar
  63. Zaki MJ (2001) SPADE: An efficient algorithm for mining frequent sequences. Machine Learn J 42: 31–60MATHCrossRefGoogle Scholar
  64. Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  65. Zaki MJ (2005a) Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66: 33–52MATHMathSciNetGoogle Scholar
  66. Zaki MJ (2005b) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowledge Data Eng 17: 1021–1035CrossRefGoogle Scholar
  67. Zaki MJ, Gouda K (2003) Fast vertical mining using Diffsets. In: 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 326–335Google Scholar
  68. Zaki MJ, Hsiao C-J (2002) ChARM: an efficient algorithm for closed itemset mining. In: 2nd SIAM international conference on data miningGoogle Scholar
  69. Zaki MJ, Hsiao C-J (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17: 462–478CrossRefGoogle Scholar
  70. Zaki MJ, Parimi N, De N, Gao F, Phoophakdee B, Urban J, Chaoji V, Hasan M, Salem S (2004) Towards generic pattern mining. In: International conference on formal concept analysis (Invited paper)Google Scholar
  71. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: 3rd International conference on knowledge discovery and data miningGoogle Scholar
  72. Zou B, Ma X, Kemme B, Newton G, Precu D (2006) Data mining using relational database management systems. In: Pacific-asia conference on knowledge discovery and data miningGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Vineet Chaoji
    • 1
  • Mohammad Al Hasan
    • 1
  • Saeed Salem
    • 1
  • Mohammed J. Zaki
    • 1
  1. 1.Computer Science DepartmentRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations