Theory of Computing Systems

, Volume 57, Issue 4, pp 1114–1158 | Cite as

Fast Learning of Restricted Regular Expressions and DTDs

  • Dominik D. Freydenberger
  • Timo Kötzing


We study the problem of generalizing from a finite sample to a language taken from a predefined language class. The two language classes we consider are subsets of the regular languages and have significance in the specification of XML documents (the classes corresponding to so-called chain regular expressions, Chares, and to single-occurrence regular expressions, Sores). The previous literature gives a number of algorithms for generalizing to Sores providing a trade-off between quality of the solution and speed. Furthermore, a fast but non-optimal algorithm for generalizing to Chares is known. For each of the two language classes we give an efficient algorithm returning a minimal generalization from the given finite sample to an element of the fixed language class; such generalizations are called descriptive. In this sense of descriptivity, both our algorithms are optimal.


Subregular language learning Single-occurrence regular expression Chain regular expression Descriptive generalization 



This work was done while Dominik D. Freydenberger was visiting the Max-Planck-Institute for Informatics in Saarbrücken. The authors wish to thank the anonymous referees both of the conference version of this article and of the present version for their helpful remarks. Also we wish to thank Ping Lu for finding a mistake in an earlier version of the algorithm for finding descriptive Sores.


  1. 1.
    Angluin, D.: Finding patterns common to a set of strings. J. Comput. Syst. Sci. 21(1), 46–62 (1980)zbMATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: Proceedings of ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 731–744 (2009)Google Scholar
  3. 3.
    Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web. 4(4), 14:1–14:32 (2010)CrossRefGoogle Scholar
  4. 4.
    Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 11:1–11:47 (2010)CrossRefGoogle Scholar
  5. 5.
    Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 998–1009 (2007)Google Scholar
  6. 6.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. McGraw Hill (2001)Google Scholar
  7. 7.
    Fernau, H.: Algorithms for learning regular expressions from positive data. Inf. Comput. 207(4), 521–541 (2009)zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of 16th International Conference on Database Theory, ICDT 2013, pp. 45–56 (2013)Google Scholar
  9. 9.
    Freydenberger, D.D., Reidenbach, D.: Existence and nonexistence of descriptive patterns. Theor. Comput. Sci. 411(34–36), 3274–3286 (2010)zbMATHMathSciNetCrossRefGoogle Scholar
  10. 10.
    Freydenberger, D.D., Reidenbach, D.: Inferring descriptive generalisations of formal languages. J. Comput. Syst. Sci. 79, 622–639 (2013)zbMATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    García, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)CrossRefGoogle Scholar
  12. 12.
    Gold, E.M.: Language identification in the limit. Inf. Control. 10(5), 447–474 (1967)zbMATHCrossRefGoogle Scholar
  13. 13.
    Hopcroft, J., Motwani, R., Ullman, J.: Introduction to Automata Theory Languages and Computation, 2nd edn. Addison-Wesley Publishing Company (2001)Google Scholar
  14. 14.
    Hopcroft, J., Ullman, J.: Introduction to Automata Theory Languages and Computation. Addison-Wesley Publishing Company (1979)Google Scholar
  15. 15.
    Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for XML schemas and chain regular expressions. SIAM J. Comput. 39(4), 1486–1530 (2009)zbMATHMathSciNetCrossRefGoogle Scholar
  16. 16.
    Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML schema. ACM Trans. Database Syst. 31(3), 770–813 (2006)CrossRefGoogle Scholar
  17. 17.
    May, W.: Information extraction and integration with Florid: The Mondial case study. Technical Report 131, Universität Freiburg, Institut für Informatik. (1999) Available from
  18. 18.
    Miklau, G.: XMLData repository. (2002) Available from
  19. 19.
    Ng, Y.K., Shinohara, T.: Developments from enquiries into the learnability of the pattern languages from positive data. Theor. Comput. Sci. 397(1–3), 150–165 (2008)zbMATHMathSciNetCrossRefGoogle Scholar
  20. 20.
    Rossmanith, P., Zeugmann, T.: Stochastic finite learning of the pattern languages. Mach. Learn. 44(1–2), 67–91 (2001)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Johann-Wolfgang-Goethe-UniversitätFrankfurt am MainGermany
  2. 2.Friedrich-Schiller-UniversitätJenaGermany

Personalised recommendations