Fast Learning of Restricted Regular Expressions and DTDs
- 240 Downloads
We study the problem of generalizing from a finite sample to a language taken from a predefined language class. The two language classes we consider are subsets of the regular languages and have significance in the specification of XML documents (the classes corresponding to so-called chain regular expressions, Chares, and to single-occurrence regular expressions, Sores). The previous literature gives a number of algorithms for generalizing to Sores providing a trade-off between quality of the solution and speed. Furthermore, a fast but non-optimal algorithm for generalizing to Chares is known. For each of the two language classes we give an efficient algorithm returning a minimal generalization from the given finite sample to an element of the fixed language class; such generalizations are called descriptive. In this sense of descriptivity, both our algorithms are optimal.
KeywordsSubregular language learning Single-occurrence regular expression Chain regular expression Descriptive generalization
This work was done while Dominik D. Freydenberger was visiting the Max-Planck-Institute for Informatics in Saarbrücken. The authors wish to thank the anonymous referees both of the conference version of this article and of the present version for their helpful remarks. Also we wish to thank Ping Lu for finding a mistake in an earlier version of the algorithm for finding descriptive Sores.
- 2.Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: Proceedings of ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 731–744 (2009)Google Scholar
- 5.Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 998–1009 (2007)Google Scholar
- 6.Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. McGraw Hill (2001)Google Scholar
- 8.Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of 16th International Conference on Database Theory, ICDT 2013, pp. 45–56 (2013)Google Scholar
- 13.Hopcroft, J., Motwani, R., Ullman, J.: Introduction to Automata Theory Languages and Computation, 2nd edn. Addison-Wesley Publishing Company (2001)Google Scholar
- 14.Hopcroft, J., Ullman, J.: Introduction to Automata Theory Languages and Computation. Addison-Wesley Publishing Company (1979)Google Scholar
- 17.May, W.: Information extraction and integration with Florid: The Mondial case study. Technical Report 131, Universität Freiburg, Institut für Informatik. (1999) Available from http://dbis.informatik.uni-goettingen.de/Mondial
- 18.Miklau, G.: XMLData repository. (2002) Available from http://www.cs.washington.edu/research/xmldatasets.