Theory of Computing Systems

, Volume 57, Issue 4, pp 1322–1371 | Cite as

XML Compression via Directed Acyclic Graphs

  • Mireille Bousquet-Mélou
  • Markus Lohrey
  • Sebastian Maneth
  • Eric Noeth


Unranked node-labeled trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.


XML Tree compression Directed acyclic graph 



The first author acknowledges the hospitality of the Institute of Computer Science, Universität Leipzig, where this work was carried out. The second and fourth author were supported by the DFG grant LO 748/8. The third author was supported by the DFG grant INST 268/239 and by the Engineering and Physical Sciences Research Council project “Enforcement of Constraints on XML streams” (EPSRC EP/G004021/1).


  1. 1.
    Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A query-conscious compressed XML database. ACM Trans. Intern. Tech. 7(2) (2007)Google Scholar
  2. 2.
    Bakibayev, N., Olteanu, D., Zavodny, J.: Fdb: A query engine for factorised relational databases. PVLDB 5(11), 1232–1243 (2012)Google Scholar
  3. 3.
    Bille, P., Landau, G. M., Raman, R., Sadakane, K., Satti, S. R., Weimann, O.: Random Access to Grammar-Compressed Strings. In: SODA, pp. 373–389 (2011)Google Scholar
  4. 4.
    Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: VLDB, pp. 141–152 (2003)Google Scholar
  5. 5.
    Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML document trees. Inf. Syst. 33(4-5), 456–474 (2008)CrossRefGoogle Scholar
  6. 6.
    de Bruijn, N.G., Knuth, D.E., Rice, S.O.: Graph theory and computing, pp 15–22. Academic Press, New York (1972)Google Scholar
  7. 7.
    Dershowitz, N., Zaks, S.: Enumerations of ordered trees. Discret. Math. 31(1), 9–28 (1980)MATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Downey, P. J., Sethi, R., Tarjan, R. E.: Variations on the common subexpression problem. J. ACM 27(4), 758–771 (1980)MATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Ershov, A. P.: On programming of arithmetic operations. Commun. ACM 1(8), 3–9 (1958)MATHCrossRefGoogle Scholar
  10. 10.
    Flajolet, P., Odlyzko, A.: The average height of binary trees and other simple trees. J. Comput. Syst. Sci. 25(2), 171–213 (1982)MATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Flajolet, P., Odlyzko, A.: Singularity analysis of generating functions. SIAM J. Discret. Math. 3(2), 216–240 (1990)MATHMathSciNetCrossRefGoogle Scholar
  12. 12.
    Flajolet, P., Sedgewick, R.: Analytic Combinatorics. Cambridge University Press (2009)Google Scholar
  13. 13.
    Flajolet, P., Sipala, P., Steyaert, J.-M.: Analytic Variations on the Common Subexpression Problem. In: ICALP, pp. 220–234 (1990)Google Scholar
  14. 14.
    Knuth, D.E.: The Art of Computer Programming, Vol. I: Fundamental Algorithms. Addison-Wesley (1968)Google Scholar
  15. 15.
    Koch, C.: Efficient processing of expressive node-selecting queries on XML data in secondary storage: A tree automata-based approach. In: VLDB, pp. 249–260 (2003)Google Scholar
  16. 16.
    Larsson, N. J., Moffat, A.: Offline Dictionary-Based Compression. In: DCC, pp. 296–305 (1999)Google Scholar
  17. 17.
    Liefke, H., XMILL, D. Suciu.: An Efficient Compressor for XML Data. In: SIGMOD Conference, pp. 153–164 (2000)Google Scholar
  18. 18.
    Lohrey, M.: Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptol. 4, 241–299 (2013)MathSciNetGoogle Scholar
  19. 19.
    Lohrey, M., Maneth, S.: The complexity of tree automata and XPath on grammar-compressed trees. Theor. Comput. Sci. 363(2), 196–210 (2006)MATHMathSciNetCrossRefGoogle Scholar
  20. 20.
    Lohrey, M., Maneth, S., Mennicke, R.: XML tree structure compression using repair. Inf. Syst. 38(8), 1150–1167 (2013)CrossRefGoogle Scholar
  21. 21.
    Lohrey, M., Maneth, S., Noeth, E.: XML Compression via Dags. In: ICDT, pp. 69–80 (2013)Google Scholar
  22. 22.
    Lohrey, M., Maneth, S., Schmidt-Schauß, M.: Parameter reduction and automata evaluation for grammar-compressed trees. J. Comput. Syst. Sci. 78(5), 1651–1669 (2012)MATHCrossRefGoogle Scholar
  23. 23.
    Maneth, S., Sebastian, T.: Fast and tiny structural self-indexes for XML. CoRR arXiv: abs/1012.5696 (2010)
  24. 24.
    Marckert, J.-F.: The rotation correspondence is asymptotically a dilatation. Random Struct. Algorithm. 24(2), 118–132 (2004)MATHMathSciNetCrossRefGoogle Scholar
  25. 25.
    Meinel, C., Theobald, T.: Algorithms and Data Structures in VLSI Design: OBDD - Foundations and Applications. Springer (1998)Google Scholar
  26. 26.
    Neven, F.: Automata theory for XML researchers. SIGMOD Rec. 31(3), 39–46 (2002)CrossRefGoogle Scholar
  27. 27.
    Nevill-Manning, C. G., Witten, I.H.: Identifying hierarchical strcture in sequences: A linear-time algorithm. J. Artif. Intell. Res. (JAIR) 7, 67–82 (1997)MATHGoogle Scholar
  28. 28.
    Plandowski, W.: Testing equivalence of morphisms on context-free languages. In: ESA, pp. 460–470 (1994)Google Scholar
  29. 29.
    Schwentick, T.: Automata for XML - a survey. J. Comput. Syst. Sci. 73(3), 289–315 (2007)MATHMathSciNetCrossRefGoogle Scholar
  30. 30.
    Suciu, D.: Typechecking for semistructured data. In: DBPL, pp. 1–20 (2001)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Mireille Bousquet-Mélou
    • 1
  • Markus Lohrey
    • 2
  • Sebastian Maneth
    • 3
  • Eric Noeth
    • 2
  1. 1.CNRS, LaBRIUniversité de BordeauxBordeauxFrance
  2. 2.University of SiegenSiegenGermany
  3. 3.University of EdinburghEdinburghScotland

Personalised recommendations