Knowledge and Information Systems

, Volume 8, Issue 2, pp 203–234 | Cite as

Canonical forms for labelled trees and their applications in frequent subtree mining

Article

Abstract

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based on the BFCF, we develop a vertical mining algorithm, RootedTreeMiner, to discover all frequently occurring subtrees in a database of labelled rooted unordered trees. The RootedTreeMiner algorithm uses an enumeration tree to enumerate all (frequent) labelled rooted unordered subtrees. Next, we extend the definition of the DFCF to labelled free trees and present an Apriori-like algorithm, FreeTreeMiner, to discover all frequently occurring subtrees in a database of labelled free trees. Finally, we study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from real applications.

Keywords

Canonical form Frequent subtree Labelled free tree Labelled rooted unordered tree Tree isomorphism 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB’94)Google Scholar
  2. 2.
    Agarwal RC, Aggarwal CC, Prasad VVV (2001) A tree projection algorithm for generation of frequent item sets. J Parallel Distribut Comput 61(3):350–371CrossRefGoogle Scholar
  3. 3.
    Aho AV, Hopcroft JE, Ullman JE (1974) The design and analysis of computer algorithms. Addison-WesleyGoogle Scholar
  4. 4.
    Aldous JM, Wilson RJ (2000) Graphs and applications. An introductory approach. Springer, Berlin Heidelberg New YorkGoogle Scholar
  5. 5.
    Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data miningGoogle Scholar
  6. 6.
    Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th international conference on discovery scienceGoogle Scholar
  7. 7.
    Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMODGoogle Scholar
  8. 8.
    Buss SR (1997) A log time algorithms for tree isomorphism, comparison, and canonization. In: Computational logic and proof theory, 5th Kurt Gödel Colloquium (KGC’97). Lecture notes in computer science, vol 1289. Springer, Berlin Heidelberg New York, pp 18–33Google Scholar
  9. 9.
    Chen Z, Jagadish HV, Korn F, Koudas N, Muthukrishnan S, Ng RT, Srivastava D (2001) Counting twig matches in a tree. In: ICDE’01, pp 595–604Google Scholar
  10. 10.
    Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the 2003 IEEE international conference on data mining (ICDM’03)Google Scholar
  11. 11.
    Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: 16th international conference on scientific and statistical database management (SSDBM’04)Google Scholar
  12. 12.
    Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: Mining both closed and maximal frequent subtrees. In: 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD’04)Google Scholar
  13. 13.
    Chung MJ (1987) O(n2.5) time algorithm for subgraph homeomorphism problem on trees. J Algorithm 8:106–112CrossRefGoogle Scholar
  14. 14.
    Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of IFIP networking 2002Google Scholar
  15. 15.
    Garey MR, Johnson DS (1979) Computers and intractability—A guide to the theory of np-completeness. Freeman, New YorkGoogle Scholar
  16. 16.
    Hein J, Jiang T, Wang L, Zhang K (1996) On the complexity of comparing evolutionary trees. Discret Appl Math 71:153–169MathSciNetCrossRefGoogle Scholar
  17. 17.
    Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraph in the presence of isomorphism. In: Proceedings of the 2003 international conference on data mining (ICDM’03)Google Scholar
  18. 18.
    Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 13–23Google Scholar
  19. 19.
    Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM’01)Google Scholar
  20. 20.
    Liu T, Geiger D (1999) Approximate tree matching and shape similarity. In: International conference on computer visionGoogle Scholar
  21. 21.
    Medina A, Lakhina A, Matta I, Byers J (2001) Brite: universal topology generation from a user’s perspective. Technical report BUCS-TR2001-003, Boston UniversityGoogle Scholar
  22. 22.
    (NCI), N C I (2003) DTP/2D and 3D structural information. World Wide Web, ftp://dtpsearch.ncifcrf.gov/jan03_2d.binGoogle Scholar
  23. 23.
    Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: 1st international workshop on mining graphs, trees and sequencesGoogle Scholar
  24. 24.
    Punin J, Krishnamoorthy M (1998) WWWPal system—a system for analysis and synthesis of web pages. In: WebNet 98 conferenceGoogle Scholar
  25. 25.
    Rückert U, Kramer S (2004) Frequent free tree discovery in graph data. In: Special track on data mining, ACM symposium on applied computing (SAC’04)Google Scholar
  26. 26.
    Setubal JC (1996) Sequential and parallel experimental results with bipartite matching algorithms. Technical report IC-96-09, Institute of Computing, State University of Campinas (Brazil)Google Scholar
  27. 27.
    Shasha D, Wang JTL, Giugno R (2002) Algorithmics and applications of tree and graph searching. In: Symposium on principles of database systems, pp 39–52Google Scholar
  28. 28.
    Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM’02), pp 450–457Google Scholar
  29. 29.
    Valiente G (2002) Algorithms on trees and graphs. Springer, Berlin Heidelberg New YorkGoogle Scholar
  30. 30.
    Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the 2002 international conference on data mining (ICDM’02)Google Scholar
  31. 31.
    Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: Proceedings of 2003 international conference knowledge discovery and data mining (SIGKDD’03)Google Scholar
  32. 32.
    Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar

Copyright information

© Springer-Verlag 2004

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of CaliforniaLos AngelesUSA

Personalised recommendations