Theory of Computing Systems

, Volume 57, Issue 4, pp 806–842 | Cite as

Optimal Probabilistic Generation of XML Documents

  • Serge Abiteboul
  • Yael Amsterdamer
  • Daniel DeutchEmail author
  • Tova Milo
  • P. Senellart


We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.


XML Schema Constraints Generator Probabilistic model 



We would like to thank Yann Ollivier for insightful comments, and Siqi Liu for feedback on the proof of Theorem 5. This work has been supported in part by the Advanced European Research Council grants Webdam, agreement 226513 (, and MoDaS, agreement 291071 (, by the Israel Ministry of Science, and by the US–Israel Binational Science Foundation.


  1. 1.
    Abiteboul, S., Amsterdamer, Y., Deutch, D., Milo, T., Senellart, P.: Finding optimal probabilistic generators for XML collections ICDT (2012)Google Scholar
  2. 2.
    Abiteboul, S., Amsterdamer, Y., Milo, T., Senellart, P.: Auto-completion learning for XML. In: SIGMOD Conference, Demonstration, pp 669–672 (2012)Google Scholar
  3. 3.
    Abiteboul, S., Benjelloun, O., Milo, T.: The active XML project: an overview. VLDB J. 17(5) (2008)Google Scholar
  4. 4.
    Abiteboul, S., Bourhis, P., Galland, A., Marinoiu, B.: The AXML artifact model. In: TIME (2009)Google Scholar
  5. 5.
    Abiteboul, S., Chan, T.-H. H., Kharlamov, E., Nutt, W., Senellart, P.: Aggregate queries for discrete and continuous probabilistic XML. In: ICDT (2010)Google Scholar
  6. 6.
    Abiteboul, S., Kimelfeld, B., Sagiv, Y., Senellart, P.: On the expressiveness of probabilistic XML models. VLDB J. 18 (5) (2009)Google Scholar
  7. 7.
    Antonopoulos, T., Geerts, F., Martens, W., Neven, F.: Generating, sampling and counting subclasses of regular tree languages. In: ICDT (2011)Google Scholar
  8. 8.
    Barbosa, D., Mendelzon, A. O., Keenleyside, J., ToXgene, K. A. Lyons.: An extensible template-based data generator for XML. In: WebDB (2002)Google Scholar
  9. 9.
    Benedikt, M., Kharlamov, E., Olteanu, D., Senellart, P.: Probabilistic XML via Markov chains. PVLDB 3(1) (2010)Google Scholar
  10. 10.
    Bex, G. J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. In: WWW (2008)Google Scholar
  11. 11.
    Bex, G. J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB (2006)Google Scholar
  12. 12.
    Bex, G. J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: VLDB (2007)Google Scholar
  13. 13.
    Bishop, C. M.: Pattern Recognition and Machine Learning. Springer (2006)Google Scholar
  14. 14.
    Chi, Z., Geman, S.: Estimation of probabilistic context-free grammars. Comput. Linguist. 24(2) (1998)Google Scholar
  15. 15.
    Cohen, S.: Generating XML structure using examples and constraints. PVLDB 1(1) (2008)Google Scholar
  16. 16.
    Cohen, S., Kimelfeld, B., Sagiv, Y.: Incorporating constraints in probabilistic XML. PODS (2008)Google Scholar
  17. 17.
    David, C., Libkin, L., Tan, T.: Efficient reasoning about data trees via integer linear programming. In: ICDT (2011)Google Scholar
  18. 18.
    Etessami, K., Yannakakis, M.: Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations. JACM 56 (1) (2009)Google Scholar
  19. 19.
    Fan, W., Libkin, L.: On XML integrity constraints in the presence of DTDs. JACM 49(3) (2002)Google Scholar
  20. 20.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD (2000)Google Scholar
  21. 21.
    Gelade, W., Idziaszek, T., Martens, W., and Neven F.: Simplifying XML Schema: Single-type approximations of regular tree languages. In: PODS (2010)Google Scholar
  22. 22.
    Grahne, G., Zhu, J.: Discovering approximate keys in XML data. In: CIKM (2002)Google Scholar
  23. 23.
    Kosala, R., Blockeel, H., Bruynooghe, M., Van den Bussche, J.: Information extraction from structured documents using k-testable tree automaton inference. Data Knowl. Eng. 58(2) (2006)Google Scholar
  24. 24.
    Lange, K.: Optimization. Springer-Verlag (2004)Google Scholar
  25. 25.
    Lary, K., Young, S. J.: The estimation of stochastic context-free grammars using the inside-outside algorithm. Comput. Speech Lang., 4 (1990)Google Scholar
  26. 26.
    Martens, W., Neven, F., Schwentick, T., Bex, G. J.: Expressiveness and complexity of XML Schema. ACM Trans. Database Syst. 31(3) (2006)Google Scholar
  27. 27.
    Martens, W., Niehren, J.: On the minimization of XML schemas and tree automata for unranked trees. J. Comput. Syst. Sci. 73(4) (2007)Google Scholar
  28. 28.
    Milo, T., Suciu, D.: Type inference for queries on semistructured data. In: PODS (1999)Google Scholar
  29. 29.
    Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5 (4) (2005)Google Scholar
  30. 30.
    Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD (1998)Google Scholar
  31. 31.
    Papakonstantinou, Y., Vianu, V.: DTD inference for views of XML data. In: PODS (2000)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Serge Abiteboul
    • 1
  • Yael Amsterdamer
    • 2
  • Daniel Deutch
    • 2
    Email author
  • Tova Milo
    • 2
  • P. Senellart
    • 3
  1. 1.INRIA Saclay and ENS CachanParisFrance
  2. 2.Tel Aviv UniversityTel AvivIsrael
  3. 3.Ben Gurion UniversityBeer ShevaISRAEL

Personalised recommendations