# XML Compression via Directed Acyclic Graphs

- 217 Downloads
- 6 Citations

## Abstract

Unranked node-labeled trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the *hybrid dag*, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.

## Keywords

XML Tree compression Directed acyclic graph## Notes

### Acknowledgements

The first author acknowledges the hospitality of the Institute of Computer Science, Universität Leipzig, where this work was carried out. The second and fourth author were supported by the DFG grant LO 748/8. The third author was supported by the DFG grant INST 268/239 and by the Engineering and Physical Sciences Research Council project “Enforcement of Constraints on XML streams” (EPSRC EP/G004021/1).

## References

- 1.Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A query-conscious compressed XML database. ACM Trans. Intern. Tech.
**7**(2) (2007)Google Scholar - 2.Bakibayev, N., Olteanu, D., Zavodny, J.: Fdb: A query engine for factorised relational databases. PVLDB
**5**(11), 1232–1243 (2012)Google Scholar - 3.Bille, P., Landau, G. M., Raman, R., Sadakane, K., Satti, S. R., Weimann, O.: Random Access to Grammar-Compressed Strings. In: SODA, pp. 373–389 (2011)Google Scholar
- 4.Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: VLDB, pp. 141–152 (2003)Google Scholar
- 5.Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML document trees. Inf. Syst.
**33**(4-5), 456–474 (2008)CrossRefGoogle Scholar - 6.de Bruijn, N.G., Knuth, D.E., Rice, S.O.: Graph theory and computing, pp 15–22. Academic Press, New York (1972)Google Scholar
- 7.Dershowitz, N., Zaks, S.: Enumerations of ordered trees. Discret. Math.
**31**(1), 9–28 (1980)MATHMathSciNetCrossRefGoogle Scholar - 8.Downey, P. J., Sethi, R., Tarjan, R. E.: Variations on the common subexpression problem. J. ACM
**27**(4), 758–771 (1980)MATHMathSciNetCrossRefGoogle Scholar - 9.Ershov, A. P.: On programming of arithmetic operations. Commun. ACM
**1**(8), 3–9 (1958)MATHCrossRefGoogle Scholar - 10.Flajolet, P., Odlyzko, A.: The average height of binary trees and other simple trees. J. Comput. Syst. Sci.
**25**(2), 171–213 (1982)MATHMathSciNetCrossRefGoogle Scholar - 11.Flajolet, P., Odlyzko, A.: Singularity analysis of generating functions. SIAM J. Discret. Math.
**3**(2), 216–240 (1990)MATHMathSciNetCrossRefGoogle Scholar - 12.Flajolet, P., Sedgewick, R.: Analytic Combinatorics. Cambridge University Press (2009)Google Scholar
- 13.Flajolet, P., Sipala, P., Steyaert, J.-M.: Analytic Variations on the Common Subexpression Problem. In: ICALP, pp. 220–234 (1990)Google Scholar
- 14.Knuth, D.E.: The Art of Computer Programming, Vol. I: Fundamental Algorithms. Addison-Wesley (1968)Google Scholar
- 15.Koch, C.: Efficient processing of expressive node-selecting queries on XML data in secondary storage: A tree automata-based approach. In: VLDB, pp. 249–260 (2003)Google Scholar
- 16.Larsson, N. J., Moffat, A.: Offline Dictionary-Based Compression. In: DCC, pp. 296–305 (1999)Google Scholar
- 17.Liefke, H., XMILL, D. Suciu.: An Efficient Compressor for XML Data. In: SIGMOD Conference, pp. 153–164 (2000)Google Scholar
- 18.Lohrey, M.: Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptol.
**4**, 241–299 (2013)MathSciNetGoogle Scholar - 19.Lohrey, M., Maneth, S.: The complexity of tree automata and XPath on grammar-compressed trees. Theor. Comput. Sci.
**363**(2), 196–210 (2006)MATHMathSciNetCrossRefGoogle Scholar - 20.Lohrey, M., Maneth, S., Mennicke, R.: XML tree structure compression using repair. Inf. Syst.
**38**(8), 1150–1167 (2013)CrossRefGoogle Scholar - 21.Lohrey, M., Maneth, S., Noeth, E.: XML Compression via Dags. In: ICDT, pp. 69–80 (2013)Google Scholar
- 22.Lohrey, M., Maneth, S., Schmidt-Schauß, M.: Parameter reduction and automata evaluation for grammar-compressed trees. J. Comput. Syst. Sci.
**78**(5), 1651–1669 (2012)MATHCrossRefGoogle Scholar - 23.Maneth, S., Sebastian, T.: Fast and tiny structural self-indexes for XML. CoRR arXiv: abs/1012.5696 (2010)
- 24.Marckert, J.-F.: The rotation correspondence is asymptotically a dilatation. Random Struct. Algorithm.
**24**(2), 118–132 (2004)MATHMathSciNetCrossRefGoogle Scholar - 25.Meinel, C., Theobald, T.: Algorithms and Data Structures in VLSI Design: OBDD - Foundations and Applications. Springer (1998)Google Scholar
- 26.Neven, F.: Automata theory for XML researchers. SIGMOD Rec.
**31**(3), 39–46 (2002)CrossRefGoogle Scholar - 27.Nevill-Manning, C. G., Witten, I.H.: Identifying hierarchical strcture in sequences: A linear-time algorithm. J. Artif. Intell. Res. (JAIR)
**7**, 67–82 (1997)MATHGoogle Scholar - 28.Plandowski, W.: Testing equivalence of morphisms on context-free languages. In: ESA, pp. 460–470 (1994)Google Scholar
- 29.Schwentick, T.: Automata for XML - a survey. J. Comput. Syst. Sci.
**73**(3), 289–315 (2007)MATHMathSciNetCrossRefGoogle Scholar - 30.Suciu, D.: Typechecking for semistructured data. In: DBPL, pp. 1–20 (2001)Google Scholar