Theory of Computing Systems

, Volume 52, Issue 3, pp 542–585 | Cite as

Generating, Sampling and Counting Subclasses of Regular Tree Languages

  • Timos AntonopoulosEmail author
  • Floris Geerts
  • Wim Martens
  • Frank Neven


To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the formal foundation for such a testbed. We adopt similarity measures based on counting the number of common and different trees in the two languages, and we develop the necessary machinery for computing them. We use the formalism of extended DTDs (EDTDs) to represent the unranked regular tree languages. In particular, we obtain an efficient algorithm to count the number of trees up to a certain size in an unambiguous EDTD. The latter class of unambiguous EDTDs encompasses the more familiar classes of single-type, restrained competition and bottom-up deterministic EDTDs. The single-type EDTDs correspond precisely to the core of XML Schema, while the others are strictly more expressive. We also show how constraints on the shape of allowed trees can be incorporated. As we make use of a translation into a well-known formalism for combinatorial specifications, we get for free a sampling procedure to draw members of any unambiguous EDTD. When dropping the restriction to unambiguous EDTDs, i.e. taking the full class of EDTDs into account, we show that the counting problem becomes #P-complete and provide an approximation algorithm. Finally, we discuss uniform generation of single-type EDTDs, i.e., the formal abstraction of XSDs. To this end, we provide an algorithm to generate k-occurrence automata (k-OAs) uniformly at random and show how this leads to the uniform generation of single-type EDTDs.


XML schema languages Counting Complexity 


  1. 1.
    Albert, J., Giammerresi, D., Wood, D.: Normal form algorithms for extended context free grammars. Theor. Comput. Sci. 267(1–2), 35–47 (2001) zbMATHCrossRefGoogle Scholar
  2. 2.
    Almeida, M., Moreira, N., Reis, R.: Enumeration and generation with a string automata representation. Theor. Comput. Sci. 387(2), 93–102 (2007) zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Barbosa, D., Mendelzon, A.O., Keenleyside, J., Lyons, K.A.: ToXgene: a template-based data generator for XML. In: International Symposium on Management of Data (SIGMOD), p. 616 (2002) Google Scholar
  4. 4.
    Bassino, F., David, J., Nicaud, C.: Enumeration and random generation of possibly incomplete deterministic automata. Pure Math. Appl. 19(2–3), 1–16 (2008) MathSciNetGoogle Scholar
  5. 5.
    Bassino, F., Nicaud, C.: Enumeration and random generation of accessible automata. Theor. Comput. Sci. 381(1–3), 86–104 (2007) zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Bertoni, A., Goldwurm, M., Sabadini, N.: The complexity of computing the number of strings of given length in context-free languages. Theor. Comput. Sci. 86(2), 325–342 (1991) zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: International Symposium on Management of Data (SIGMOD), pp. 731–744 (2009) Google Scholar
  8. 8.
    Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. In: International World Wide Web Conference (WWW), pp. 825–834 (2008) Google Scholar
  9. 9.
    Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4) (2010) Google Scholar
  10. 10.
    Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Transactions on Database Systems (2010) Google Scholar
  11. 11.
    Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: International Conference on Very Large Data Bases (VLDB), pp. 998–1009 (2007) Google Scholar
  12. 12.
    Björklund, H., Martens, W.: The tractability frontier for NFA minimization. In: International Colloquium on Automata, Languages and Programming (ICALP), pp. 27–38 (2008) CrossRefGoogle Scholar
  13. 13.
    Brüggemann-Klein, A.: Regular expressions into finite automata. In: Latin American Symposium on Theoretical Informatics (LATIN), pp. 87–98 (1992) Google Scholar
  14. 14.
    Brüggemann-Klein, A., Murata, M., Wood, D.: Regular tree and regular hedge languages over unranked alphabets: version 1, April 3. Technical report HKUST-TCSC-2001-0, The Hongkong University of Science and Technology (2001) Google Scholar
  15. 15.
    Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998) zbMATHCrossRefGoogle Scholar
  16. 16.
    Cohen, S., Kimelfeld, B., Sagiv, Y.: Incorporating constraints in probabilistic XML. ACM Trans. Database Syst. 34(3), 1–45 (2009) CrossRefGoogle Scholar
  17. 17.
    Cohen, S., Kimelfeld, B., Sagiv, Y.: Running tree automata on probabilistic XML. In: International Symposium on Principles of Database Systems (PODS), pp. 227–236 (2009) Google Scholar
  18. 18.
    Flajolet, P., Zimmermann, P., Van Cutsem, B.: A calculus for the random generation of labelled combinatorial structures. Theor. Comput. Sci. 132(2), 1–35 (1994) zbMATHCrossRefGoogle Scholar
  19. 19.
    Gelade, W., Idziaszek, T., Martens, W., Neven, F.: Simplifying XML schema: single-type approximations of regular tree languages. In: International Symposium on Principles of Database Systems (PODS) (2010) Google Scholar
  20. 20.
    Gelade, W., Neven, F.: Succinctness of pattern-based schema languages for XML. J. Comput. Syst. Sci. 77(3), 505–519 (2011) zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Gore, V., Jerrum, M., Kannan, S., Sweedyk, Z., Mahaney, S.R.: A quasi-polynomial-time algorithm for sampling words from a context-free language. Inf. Comput. 134(1), 59–74 (1997) zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Héam, P.-C., Nicaud, C., Schmitz, S.: Random generation of deterministic tree (walking) automata. In: International Conference on Implementation and Application of Automata (CIAA), pp. 115–124 (2009) CrossRefGoogle Scholar
  23. 23.
    Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation, 3rd edn. Addison-Wesley, Reading (2007) Google Scholar
  24. 24.
    Kannan, S., Sweedyk, Z., Mahaney, S.R.: Counting and random generation of strings in regular languages. In: ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 551–557 (1995) Google Scholar
  25. 25.
    Martens, W., Neven, F., Schwentick, T.: Simple off the shelf abstractions of XML schema. SIGMOD Rec. 36(3), 15–22 (2007) CrossRefGoogle Scholar
  26. 26.
    Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for XML schemas and chain regular expressions. SIAM J. Comput. 39(4), 1486–1530 (2009) zbMATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML schema. ACM Trans. Database Syst. 31(3), 770–813 (2006) CrossRefGoogle Scholar
  28. 28.
    Martens, W., Niehren, J.: On the minimization of XML Schemas and tree automata for unranked trees. J. Comput. Syst. Sci. 73(4), 550–583 (2007) zbMATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Meyer, A.R., Fischer, M.J.: Economy of description by automata, grammars, and formal systems. In: FOCS, pp. 188–191. IEEE, New York (1971) Google Scholar
  30. 30.
    Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005) CrossRefGoogle Scholar
  31. 31.
    Nijenhuis, A., Wilf, H.: Combinatorial Algorithms. Academic Press, San Diego (1979) Google Scholar
  32. 32.
    Seidl, H.: Deciding equivalence of finite tree automata. SIAM J. Comput. 19(3), 424–437 (1990) zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Timos Antonopoulos
    • 1
    Email author
  • Floris Geerts
    • 2
  • Wim Martens
    • 3
  • Frank Neven
    • 1
  1. 1.Hasselt University and Transnational University of LimburgHasseltBelgium
  2. 2.University of AntwerpAntwerpBelgium
  3. 3.Universität BayreuthBayreuthGermany

Personalised recommendations