Abstract
Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.
Similar content being viewed by others
Notes
The upper bound of the scope of the last vertex is already available in scope; for convenience of presentation, the information is duplicated in RP.
References
Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: Proceedings of the second SIAM international conference on data mining (SDM), SIAM, pp 158–174
Balcazar JL, Bifet A, Lozano A (2010) Mining frequent closed rooted trees. Mach Learn 78(1–2):1–33
Bille P, Gortz I (2011) The tree inclusion problem: in linear space and faster. ACM Trans Algorithm 7(3):1–47
Chalmers R, Almeroth K (2001) Modeling the branching characteristics and efficiency gains of global multicast trees. In: Proceedings of the 20th IEEE international conference on computer communications (INFOCOM), pp 449–458
Chalmers RC, Member S, Almeroth KC (2003) On the topology of multicast trees. IEEE/ACM Trans Netw 11:153–165
Chaoji V, Hasan MA, Salem S, Zaki MJ (2008) An integrated, generic approach to pattern mining: data mining template library. Data Min Knowl Discov 17(3):457–495
Chehreghani MH (2011) Efficiently mining unordered trees. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), pp 111–120
Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M (2011) OInduced: an efficient algorithm for mining induced patterns from rooted ordered trees. IEEE Trans Syst Man Cybern A 41(5):1013–1025
Chi Y, Muntz RR, Nijssen S, Kok JN (2005) Frequent subtree mining—an overview. Fundam Inf 66(1–2):161–198
Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 509–512
Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of the second international IFIP-TC6 networking conference on networking technologies, services, and protocols; performance of computer and communication networks; and mobile and wireless communications (NETWORKING), pp 1032–1044
Diestel R (2010) Graph theory, 4th edn. Springer, Heidelberg
Dietz PF (1982) Maintaining order in a linked list. In: Proceedings of the 14th ACM symposium on theory of computing (STOC), pp 122–127
Ivancsy R, Vajk I (2006) Frequent pattern mining in web log data. Acta Polytech Hung 3(1):77–90
Kilpelainen P, Mannila H (1995) Ordered and unordered tree inclusion. SIAM J Comput 24(2):340–356
Miyahara T, Suzuki Y, Shoudai T, Uchida T, Takahashi K, Ueda H (2004) Discovery of maximally frequent tag tree patterns with contractible variables from semistructured documents. In: Proceedings of the 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD), pp 133–144
Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: Proceedings of the first international workshop on mining graphs, trees, and sequences (MGTS), pp 55–64
Qin L, Yu JX, Ding B (2007) TwigList: make twig pattern matching fast. In: Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), pp 850–862
Sidhu AS, Dillon TS, Chang E (2006) Protein ontology. In: Ma Z, Chen JY (eds) Database modeling in biology: practices and challenges. Springer, New York, pp 39–60
Tan H, Hadzic F, Dillon TS, Chang E, Feng L (2008) Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans Knowl Discov Data 2(2):43. doi:10.1145/1376815.1376818
Tatikonda S, Parthasarathy S (2009) Mining tree-structured data on multicore systems. Proc VLDB Endow 2(1):694–705
Tatikonda S, Parthasarathy S, Kurc TM (2006) TRIPS and TIDES: new algorithms for tree mining. In: Proceedings of the 15th ACM international conference on information and knowledge management (CIKM), pp 455–464 (2006)
Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Proceedings of the 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD), pp 441–451
Xiao Y, Yao JF, Li Z, Dunham MH (2003) Efficient data mining for maximal frequent subtrees. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 379–386
Zaki MJ (2005) Efficiently mining frequent embedded unordered trees. Fundam Inf 66(1–2):33–52
Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Tran Knowl Data Eng 17(8):1021–1035
Zaki MJ, Aggarwal CC (2006) XRules: an effective algorithm for structural classification of XML data. Mach Learn 62(1–2):137–170
Acknowledgments
We are grateful to Professor Mohammed Javeed Zaki for providing the VTreeMiner code, the CSLOGS datasets and the TreeGenerator program, to Dr Henry Tan for providing the MB3Miner code, to Dr Fedja Hadzic for providing the Prions dataset and to Professor Jun-Hong Cui for providing the NASA dataset. Finally, we would like to thank Dr Morteza Haghir Chehreghani for his discussion and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge and Concha Bielza.
Rights and permissions
About this article
Cite this article
Haghir Chehreghani, M., Bruynooghe, M. Mining rooted ordered trees under subtree homeomorphism. Data Min Knowl Disc 30, 1249–1272 (2016). https://doi.org/10.1007/s10618-015-0439-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0439-5