Advertisement

Computer Science - Research and Development

, Volume 24, Issue 1–2, pp 51–68 | Cite as

Storing and indexing XML documents upside down

  • Christian MathisEmail author
  • Theo Härder
  • Karsten Schmidt
Special Issue Paper

Abstract

XML documents contain substantial redundancy in their structure part, because each path from the root node to a leaf node is explicitly represented and typically large sets of such path instances belong to a path class, i.e., the nodes of the path instances are labeled by the same sequence of element (or attribute) names. To save storage space and I/O cost, we want to get rid of this structural redundancy to the extent possible. While all known methods for the physical representation (storage) of XML documents proceed from the root via the element/attribute hierarchy (internal nodes) down to the leaves (values), we follow an upside-down approach which explicitly stores the values and only reconstructs the internal nodes, if needed. The cornerstones for such a solution are suitable node labels and a path synopsis which efficiently represents all path classes of an XML document. As a solution, we propose a compact internal storage format for native XML database systems where the inner structure of the stored documents is virtualized. Because this elementless storage format provides an efficient reconstruction of a document using its path synopsis, all processing properties are preserved and the semantics of navigational and declarative operations of XML languages remains unchanged. Adjusted indexes support the full spectrum of so-called content-and-structure single path queries. Apart from greatly reduced storage consumption, our approach demonstrates its superiority, compared to competing methods, not only for a substantial fraction of those queries, but also for storing, reconstructing, and navigating XML documents.

Keywords

Storage formats XML indexes native XML database management systems elementless XML storage Path synopsis Prefix-based node labeling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Al-Khalifa S, Jagadish HV, Patel JM, Wu Y, Koudas N, Srivastava D (2002) Structural Joins: A Primitive for Efficient XML Query Pattern Matching. Proc. Int. Conf. on Data Engineering (ICDE), 141–152Google Scholar
  2. 2.
    Arion A, Bonifati A, Manolescu I, Pugliese A (2008) Path Summaries and Path Partitioning in Modern XML Databases. World Wide Web 11(1):117–151CrossRefGoogle Scholar
  3. 3.
    Beyer KS, Cochrane R, Josifovski V, Kleewein J, Lapis G, Lohman GM, Lyle R, Özcan F, Pirahesh H, Seemann N, Truong TC, Van der Linden B, Vickery B, Zhang C (2005) System RX: One Part Relational, One Part XML, Proc. ACM SIGMOD Conf., 374–358Google Scholar
  4. 4.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426zbMATHCrossRefGoogle Scholar
  5. 5.
    Böhme T, Rahm E (2004) Supporting Efficient Streaming and Insertion of XML Data in RDBMS. Proc. 3rd DIWeb Workshop, 70–81Google Scholar
  6. 6.
    Bruno N, Koudas N, Srivastava D (2002) Holistic Twig Joins: Optimal XML Pattern Matching. Proc. ACM SIGMOD Conf., 310–321Google Scholar
  7. 7.
    Christophides V, Plexousakis D, Scholl M, Tourtounis S (2003) On Labeling Schemes for the Semantic Web. Proc. 12th Int. WWW Conf., 544–555Google Scholar
  8. 8.
    Fiebig T, Helmer S, Kanne C-C, Moerkotte G, Neumann J, Schiele R, Westmann T (2003) Natix: A Technology Overview. Lecture Notes in Computer Science 2593:12–33, SpringerGoogle Scholar
  9. 9.
    Florescu D, Kossmann D (1999) Storing and querying XML data using an RDBMS. IEEE Data Eng Bull 22:27–34Google Scholar
  10. 10.
    Georgiadis H, Vassalos V (2007) XPath on Steroids: Exploiting Relational Engines for XPath Performance. Proc. ACM SIGMOD Conf., 317–328Google Scholar
  11. 11.
    Goldman R, Widom J (1997) DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Proc. Int. Conf. on Very Large Data Bases (VLDB), 436–445Google Scholar
  12. 12.
    Graefe G, Larson P-A (2001) B-Tree Indexes and CPU Caches. Proc. Int. Conf. on Data Engineering (ICDE), 349–358Google Scholar
  13. 13.
    Grust T, van Keulen M, Teubner J (2003) Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. Proc. Int. Conf. on Very Large Data Bases (VLDB), 524–525Google Scholar
  14. 14.
    Härder T, Haustein MP, Mathis C, Wagner M (2007) Node Labeling Schemes for Dynamic XML Documents Reconsidered. Data Knowl Eng 60(1):126–149CrossRefGoogle Scholar
  15. 15.
    Härder T, Mathis C, Schmidt K (2007) Comparison of Complete and Elementless Native Storage of XML Documents. Proc. Int. Database Engineering and Applications Symposium (IDEAS), 102–113Google Scholar
  16. 16.
    Haustein MP, Härder T (2007) An efficient infrastructure for native transactional XML processing. Data Knowl Eng 61(3):500–523CrossRefGoogle Scholar
  17. 17.
    Haustein MP, Härder T (2008) Optimizing lock protocols for native XML processing. Data Knowl Eng 65(1):147–173Google Scholar
  18. 18.
    Izadi K, Härder T, Haghjoo M (2009) S3: Evaluation of tree-pattern queries supported by structural summaries. Data Knowl Eng 68(1):126–145CrossRefGoogle Scholar
  19. 19.
    Jiang H, Wang W, Lu H, Xu Yu J (2003) Holistic Twig Joins on Indexed XML Documents. Proc. Int. Conf. on Very Large Data Bases (VLDB), 273–284Google Scholar
  20. 20.
    Kaushik R, Shenoy P, Bohannon P, Gudes E (2002) Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. Proc. Int. Conf. on Data Engineering (ICDE), 129–140Google Scholar
  21. 21.
    Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the Integration of Structure Indexes and Inverted Lists. Proc. ACM SIGMOD Conf., 779–790Google Scholar
  22. 22.
    Li H-G, Aghili SA, Agrawal D, El Abbadi A (2006) FLUX: Content and Structure Matching of XPath Queries with Range Predicates. Proc. Int. XML Database Symposium (XSym), Lecture Notes in Computer Science, 4156, 61–76Google Scholar
  23. 23.
    Li C, Ling TW, Hu M (2008) Efficient updates in dynamic XML data: from binary string to quaternary string. VLDB J 17(3):573–601CrossRefGoogle Scholar
  24. 24.
    Liefke H, Suciu D (2000) XMill: An Efficient Compressor for XML Data. Proc. ACM SIGMOD Conf., 153–164Google Scholar
  25. 25.
    Loeser H (2008) XML Storage – It’s the Flexibility, Stupid!. Computer Science colloquium, University of KaiserslauternGoogle Scholar
  26. 26.
    Loeser H, Nicola M, Fitzgerald J (2009) Index Challenges in Native XML Database systems. in: Proc. German National Database Conf. (BTW), Münster, Lecture Notes in Informatics, GI-EditionGoogle Scholar
  27. 27.
    Lu J, Ling TW, Chan CY, Chen T (2005) From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. Proc. Int. Conf. on Very Large Data Bases (VLDB), 193–204Google Scholar
  28. 28.
    Mathis C (2009) Storing, Indexing, and Processing XML Documents in Native XML Database Management Systems. Ph.D. thesis, University of KaiserslauternGoogle Scholar
  29. 29.
    McHugh J, Widom J, Abiteboul S, Luo Q, Rajaraman A (1998) Indexing Semistructured Data. Technical report, Stanford UniversityGoogle Scholar
  30. 30.
    Meier W (2002) eXist: An Open Source Native XML Database. Lecture Notes in Computer Science 2593:169–183, SpringerGoogle Scholar
  31. 31.
    Mignet L, Barbosa D, Veltri P (2003) The XML Web: a First Study. Proc. 12th Int. WWW Conf., Budapest). http://www.cs.toronto.edu/ mignet/Publications/www2003.pdf
  32. 32.
    Miklau G (2006) XML Data Repository, http://www.cs.washington.edu/research/xmldatasets
  33. 33.
    Milo T, Suciu D (1999) Index Structures for Path Expressions. Proc. Int. Conf. on Database Theory (ICDT), 277–295Google Scholar
  34. 34.
    Ng W, Lam WY, Cheng J (2006) Comparative analysis of XML compression technologies. World Wide Web 9(1):5–33CrossRefGoogle Scholar
  35. 35.
    O’Neil PE, O’Neil EJ, Pal S, Cseri I, Schaller G, Westbury N (2004) OrdPaths: Insert-Friendly XML Node Labels. Proc. ACM SIGMOD Conf., 903–908Google Scholar
  36. 36.
    Sample N, Cooper BF, Franklin MJ, Hjaltason GR, Shadmon M, Cohe L (2002) Managing Complex and Varied Data with the IndexFabric(tm). Proc. Int. Conf. on Data Engineering (ICDE), 492–493Google Scholar
  37. 37.
    Schmidt AR, Waas F, Kersten ML, Carey MJ, Manolescu I, Busse R (2002) XMark: A Benchmark for XML Data Management. Proc. Int. Conf. on Very Large Data Bases (VLDB), 974–985Google Scholar
  38. 38.
    Skibinski P, Swacha J (2007) Combining Efficient XML Compression with Query Processing, Proc. East European Conf. on Advances in Databases and Information Systems (ADBIS), 330–342Google Scholar
  39. 39.
    Staken K (2005) Xindice 1.1 User GuideGoogle Scholar
  40. 40.
    W3C Recommendations (2004) http://www.w3c.org
  41. 41.
    XML Path Language (XPath), Version 1.0. W3C Recommendation (Nov. 1999)Google Scholar
  42. 42.
    XQuery 1.0: An XML Query Language. W3C Recommendation (Jan. 2007)Google Scholar
  43. 43.
    Yoshikawa M, Amagasa T, Shimura T, Uemura S (2001) XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. ACM Trans Internet Technol (TOIT) 1:110–141CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Christian Mathis
    • 1
    Email author
  • Theo Härder
    • 1
  • Karsten Schmidt
    • 1
  1. 1.Dept. of Computer ScienceUniversity of KaiserslauternKaiserslauternGermany

Personalised recommendations