Inferring Deterministic Regular Expression with Unorder

  • Xiaofan Wang
  • Haiming ChenEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12011)


Schema inference has been an essential task in database management, and can be reduced to learning regular expressions from sets of positive finite-sample. In this paper, we extend the single-occurrence regular expressions (SOREs) to single-occurrence regular expressions with unorder (uSOREs), and give an inference algorithm for uSOREs. First, we present an unorder-countable finite automaton (uCFA). Then, we construct an uCFA for recognizing the given finite sample. Next, the uCFA runs on the given finite sample to count the number of occurrences of the subexpressions (connectable via unorder) for every possibly repeated matching. Finally we transform the uCFA to an uSORE according to the above results of counting. Experimental results demonstrate that, for larger samples, our algorithm can efficiently infer an uSORE with better generalization ability.


Schema inference Regular expressions Automata Unorder 


  1. 1.
    The JSON query language.
  2. 2. The home of JSON Schema.
  3. 3.
    Abiteboul, S., Bourhis, P., Vianu, V.: Highly expressive query languages for unordered data trees. Theory Comput. Syst. 57(4), 927–966 (2015)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Barbosa, D., Mignet, L., Veltri, P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web 9(2), 187–212 (2006)CrossRefGoogle Scholar
  5. 5.
    Bex, G.J., Martens, W., Neven, F., Schwentick, T.: Expressiveness of XSDs: from practice to theory, there and back again. In: Proceedings of the 14th International Conference on World Wide Web, pp. 712–721. ACM (2005)Google Scholar
  6. 6.
    Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: a practical study. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 79–84. ACM (2004)Google Scholar
  7. 7.
    Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases, Seoul, Korea, pp. 115–126, September 2006Google Scholar
  8. 8.
    Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)CrossRefGoogle Scholar
  9. 9.
    Boneva, I., Ciucanu, R., Staworko, S.: Schemas for unordered XML on a DIME. Theory Comput. Syst. 57(2), 337–376 (2015)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)CrossRefGoogle Scholar
  12. 12.
    Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. arXiv preprint arXiv:1307.6348 (2013)
  13. 13.
    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56. ACM (2013)Google Scholar
  14. 14.
    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Hovland, D.: The membership problem for regular expressions with unordered concatenation and numerical constraints. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 313–324. Springer, Heidelberg (2012). Scholar
  16. 16.
    Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: VLDB, vol. 1, pp. 241–250 (2001)Google Scholar
  17. 17.
    Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 64–78. Springer, Heidelberg (2003). Scholar
  18. 18.
    Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: Proceedings of the 12th International Conference on World Wide Web, pp. 500–510. ACM (2003)Google Scholar
  19. 19.
    International Organization for Standardization: Information Processing: Text and Office Systems: Standard Generalized Markup Language (SGML). ISO (1986)Google Scholar
  20. 20.
    Staworko, S., Boneva, I., Gayo, J.E.L., Hym, S., Prud’Hommeaux, E.G., Solbrig, H.: Complexity and expressiveness of ShEx for RDF. In: 18th International Conference on Database Theory (ICDT 2015) (2015)Google Scholar
  21. 21.
    Thompson, H., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures, 2nd Edn. W3C Recommendation (2004)Google Scholar
  22. 22.
    Wang, X., Chen, H.: Inferring deterministic regular expression with counting. In: Trujillo, J., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 184–199. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.State Key Laboratory of Computer Science, Institute of SoftwareChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations