Data Mining and Knowledge Discovery

, Volume 18, Issue 3, pp 472–516 | Cite as

Incremental sequence-based frequent query pattern mining from XML queries

  • Guoliang Li
  • Jianhua Feng
  • Jianyong Wang
  • Lizhu Zhou
Article

Abstract

Existing algorithms of mining frequent XML query patterns (XQPs) employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods compute the frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XQPs transferred from user query logs. However, it is not straightforward to maintain such discovered frequent patterns in real XML databases as there may be frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. Therefore, a drawback of existing methods is that they are rather inefficient for the evolution of transaction databases. To address above-mentioned problems, this paper proposes an efficient algorithm ESPRIT to mine frequent XQPs without costly tree-containment checking. ESPRIT transforms XML queries into sequences using a one-to-one mapping technique and mines the frequent sequences to generate frequent XQPs. We propose two efficient incremental algorithms, ESPRIT-i and ESPRIT-i+, to incrementally mine frequent XQPs. We devise several novel optimization techniques of query rewriting, cache lookup, and cache replacement to improve the answerability and the hit rate of caching. We have implemented our algorithms and conducted a set of experimental studies on various datasets. The experimental results demonstrate that our algorithms achieve high efficiency and scalability and outperform state-of-the-art methods significantly.

Keywords

XML query patterns Frequent query patterns XML frequent pattern mining Incremental mining Sequential pattern mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499Google Scholar
  2. Aggarwal C, Ta N, Wang J, Feng J, Zaki MJ (2007) Xproj: a framework for projected structural clustering of xml documents. In: KDDGoogle Scholar
  3. Asai T, Abe K, Kawasoe S, Arimura H, Sakamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: SDMGoogle Scholar
  4. Aumann Y, Feldman R, Liphstat O, Mannila H (1999) Borders: an efficient algorithm for association generation in dynamic databases. J Intell Inf Syst 12(1): 61–73CrossRefGoogle Scholar
  5. Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: KDDGoogle Scholar
  6. Balmin A, Ozcan F, Beyer K, Cochrane R, Pirahesh H (2004) A framework for using materialized xpath views in xml query processing. In: VLDB, pp 60–71Google Scholar
  7. Bettini C, Wang XS, Jajodia S (1998) Mining temporal relationships with multiple granularities in time sequences. IEEE Data Eng Bull 21(1): 32–38MathSciNetGoogle Scholar
  8. Chen L, Rundensteiner EA, Wang S (2002) Xcache: a semantic caching system for xml queries. In: SIGMODGoogle Scholar
  9. Chen Y, Yang L, Wang YG (2004) Incremental mining of frequent xml query patterns. In: ICDM, pp 343–346Google Scholar
  10. Chung C-W, Min J-K, Shim K (2002) Apex: an adaptive path index for xml data. In: SIGMOD Conference, pp 121–132Google Scholar
  11. Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: KDD, pp 30–36Google Scholar
  12. Feng J, Qian Q, Wang J, Zhou L (2006) Exploit sequencing to accelerate hot xml query pattern mining. In: ACM SACGoogle Scholar
  13. Feng J, Ta N, Li G (2007) Exploit sequencing views in semantic cache to accelerate xpath query evaluation. In: WWWGoogle Scholar
  14. Han J, Dong G, Yin Y (1999) Efficient mining of partial periodic patterns in time series database. In: ICDE, pp 106–115Google Scholar
  15. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M (2000) Freespan: frequent pattern-projected sequential pattern mining. In: KDD, pp 355–359Google Scholar
  16. Hristidis V, Petropoulos M (2002) Semantic caching of xml databases. In: WebDBGoogle Scholar
  17. Kaushik R, Shenoy P, Bohannon P, Gudes E (2002) Exploiting local similarity for indexing paths in graph-structured data. In: ICDE, pp 129–140Google Scholar
  18. Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: ICDM, pp 313–320Google Scholar
  19. Kwon J, Rao P, Moon B, Lee S (2005) Fist: scalable xml document filtering by sequencing twig patterns. In: VLDB, pp 217–228Google Scholar
  20. Li G, Feng J, Ta N, Zhang Y, Zhou L (2006a) Scend: an efficient semantic cache to exploit xpath query/view answerability. In: WISE, pp 460–473Google Scholar
  21. Li G, Feng J, Wang J, Zhang Y, Zhou L (2006b) Incremental mining of frequent query patterns from xml queries for caching. In: ICDM, pp 350–361Google Scholar
  22. Luo Q, Krishnamurthy S, Mohan C, Pirahesh H, Woo H, Lindsay BG, Naughton JF (2002) Middle-tier database caching for e-business. In: SIGMOD, pp 600–611Google Scholar
  23. Mandhani B, Suciu D (2005) Query caching and view selection for xml databases. In: VLDBGoogle Scholar
  24. Masseglia F, Cathala F, Poncelet P (1998) The psp approach for mining sequential patterns. In: PKDDGoogle Scholar
  25. Milo T, Suciu D (1999) Index structures for path expressions. In: ICDT, pp 277–295Google Scholar
  26. Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: ICDE, pp 412–421Google Scholar
  27. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: Mining sequential patterns by prefix-projected growth. In: ICDE, pp 215–224Google Scholar
  28. Prüfer H (1918) Neuer beweis eines satzes uber permutationen. Archiv fur Mathematik und Physik 27: 142–144Google Scholar
  29. Qun C, Lim A, Ong KW (2003) D(k)-index: an adaptive structural summary for graph-structured data. In: SIGMOD, pp 134–144Google Scholar
  30. Rao PR, Moon B (2004) Prix: indexing and querying xml using prufer sequences. In: ICDE, pp 288–299Google Scholar
  31. Re C, Brinkley J, Hinshaw K, Suciu D (2004) Distributed xquery. In: Information integration on the web (IIWeb)Google Scholar
  32. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: EDBT, pp 3–17Google Scholar
  33. Termier A, Rousset M-C, Sebag M (2002) Treefinder: a first step towards xml data mining. In: ICDM, pp 450–457Google Scholar
  34. Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: ICDE, pp 79–90Google Scholar
  35. Wang K, Liu H (2000) Discovering structural association of semistructured data. IEEE TKDE 12(2): 353–371Google Scholar
  36. Wang H, Park S, Fan W, Yu PS (2003) Vist: a dynamic index method for querying xml data by tree structures. In: SIGMOD, pp 110–121Google Scholar
  37. Xu W (2005) The framework of an xml semantic caching system. In: WebDBGoogle Scholar
  38. Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: SDMGoogle Scholar
  39. Yang J, Wang W, Yu PS, Han J (2002) Mining long sequential patterns in a noisy environment. In: SIGMOD, pp 406–417Google Scholar
  40. Yang LH, Lee M-L, Hsu W (2003a) Efficient mining of xml query patterns for caching. In: VLDB, pp 69–80Google Scholar
  41. Yang LH, Lee M-L, Hsu W, Acharya S (2003b) Mining frequent query patterns from xml queries. In: DASFAA, pp 355–362Google Scholar
  42. Yang LH, Lee ML, Hsu W, Guo X (2004) 2pxminer: an efficient two pass mining of frequent xml query patterns. In: KDDGoogle Scholar
  43. Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: SIGKDD, pp 71–80Google Scholar
  44. Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE TKDE 17(8): 1021–1035Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Guoliang Li
    • 1
  • Jianhua Feng
    • 1
  • Jianyong Wang
    • 1
  • Lizhu Zhou
    • 1
  1. 1.Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations