Even an Ant Can Create an XSD

  • Ondřej Vošta
  • Irena Mlýnková
  • Jaroslav Pokorný
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4947)


The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. In particular, we focus on new features of XML Schema language and we propose an algorithm which is an improvement of a combination of verified approaches that is, at the same time, enough general and can be further enhanced. Using a set of experiments we illustrate the behavior of the algorithm on both real-world and artificial XML data.


Regular Expression Edit Operation Candidate Graph Merging State Trivial Schema 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
    Ahonen, H.: Generating Grammars for Structured Documents Using Grammatical Inference Methods. Report A-1996-4, Dep. of Computer Science, University of Helsinki (1996)Google Scholar
  12. 12.
    Bartak, R.: On-Line Guide to Constraint Programming (1998),
  13. 13.
    Berstel, J., Boasson, L.: XML Grammars. In: Nielsen, M., Rovan, B. (eds.) MFCS 2000. LNCS, vol. 1893, pp. 182–191. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  14. 14.
    Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: a Practical Study. In: WebDB 2004: Proc. of the 7th Int. Workshop on the Web and Databases, New York, NY, USA, pp. 79–84. ACM Press, New York (2004)CrossRefGoogle Scholar
  15. 15.
    Bex, G.J., Neven, F., Vansummeren, S.: XML Schema Definitions from XML Data. In: VLDB 2007: Proc. of the 33rd Int. Conf. on Very Large Data Bases, Vienna, Austria, pp. 998–1009. ACM Press, New York (2007)Google Scholar
  16. 16.
    Biron, P.V., Malhotra, A.: XML Schema Part 2: Datatypes, 2nd edn. W3C (2004),
  17. 17.
    Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F.: Extensible Markup Language (XML) 1.0, 4th edn. W3C (2006)Google Scholar
  18. 18.
    Dorigo, M., Birattari, M., Stutzle, T.: Ant Colony Optimization – Artificial Ants as a Computational Intelligence Technique. Technical Report TR/IRIDIA/2006-023, IRIDIA, Bruxelles, Belgium (2006)Google Scholar
  19. 19.
    Fernau, H.: Learning XML Grammars. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 73–87. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  20. 20.
    Gao, S., Sperberg-McQueen, C.M., Thompson, H.S.: XML Schema Definition Language (XSDL) 1.1 Part 1: Structures. W3C (2007),
  21. 21.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a System for Extracting Document Type Descriptors from XML Documents. In: SIGMOD 2000: Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, pp. 165–176. ACM Press, New York (2000)CrossRefGoogle Scholar
  22. 22.
    Gold, E.M.: Language Identification in the Limit. Information and Control 10(5), 447–474 (1967)CrossRefGoogle Scholar
  23. 23.
    Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: VLDB 1997: Proc. of the 23rd Int. Conf. on Very Large Data Bases, pp. 436–445. Morgan Kaufmann, San Francisco (1997)Google Scholar
  24. 24.
    Grunwald, P.D.: A Tutorial Introduction to the Minimum Description Principle (2005),
  25. 25.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall College Div., Englewood Cliffs (1988)zbMATHGoogle Scholar
  26. 26.
    Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a First Study. In: WWW 2003: Proc. of the 12th Int. Conf. on World Wide Web, vol. 2, pp. 500–510. ACM Press, New York (2003)CrossRefGoogle Scholar
  27. 27.
    Mlynkova, I., Toman, K., Pokorny, J.: Statistical Analysis of Real XML Data Collections. In: COMAD 2006: Proc. of the 13th Int. Conf. on Management of Data, pp. 20–31. Tata McGraw-Hill Publishing Company Limited, New York (2006)Google Scholar
  28. 28.
    Moh, C.-H., Lim, E.-P., Ng, W.-K.: Re-engineering Structures from Web Documents. In: DL 2000: Proc. of the 5th ACM Conf. on Digital Libraries, pp. 67–76. ACM Press, New York (2000)CrossRefGoogle Scholar
  29. 29.
    Murata, M., Lee, D., Mani, M.: Taxonomy of XML Schema Languages Using Formal Language Theory. ACM Trans. Inter. Tech. 5(4), 660–704 (2005)CrossRefGoogle Scholar
  30. 30.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: WebDB 2002: Proc. of the 5th Int. Workshop on the Web and Databases, Madison, Wisconsin, USA, pp. 61–66. ACM Press, New York (2002)Google Scholar
  31. 31.
    Peterson, D., Biron, P.V., Malhotra, A., Sperberg-McQueen, C.M.: XML Schema 1.1 Part 2: Datatypes. W3C (2006),
  32. 32.
    Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures, 2nd edn., W3C (2004),
  33. 33.
    Wong, R.K., Sankey, J.: On Structural Inference for XML Data. Technical Report UNSW-CSE-TR-0313, School of Computer Science, The University of New South Wales (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ondřej Vošta
    • 1
  • Irena Mlýnková
    • 1
  • Jaroslav Pokorný
    • 1
  1. 1.Faculty of Mathematics and Physics, Department of Software EngineeringCharles UniversityPrague 1Czech Republic

Personalised recommendations