Learning k-Occurrence Regular Expressions from Positive and Negative Samples

  • Yeting Li
  • Xiaoying Mou
  • Haiming ChenEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11788)


Deterministic regular expressions (DREs) are a core part of XML schema languages such as DTD/XSD and are used in different kinds of applications. Presently the most powerful model to learn DREs is k-occurrence regular expressions (k-OREs for short). However, there has been no algorithms can learn k-OREs from positive and negative samples. In this paper, we propose an efficient and effective algorithm to learn k-OREs from positive and negative samples. Our algorithm proceeds as follows: (1) learning deterministic k-OA from positive and negative samples based on genetic algorithm; (2) converting the k-OA into optimum deterministic k-OREs.


XML schema Deterministic regular expressions Language learning Positive and negative samples 


  1. 1.
    Abiteboul, S., Milo, T., Benjelloun, O.: Regular rewriting of active XML and unambiguity. In: Proceedings of the 24th SIGMOD, pp. 295–303 (2005)Google Scholar
  2. 2.
    Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. TWEB 4(4), 14:1–14:32 (2010)CrossRefGoogle Scholar
  3. 3.
    Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 11:1–11:47 (2010)CrossRefGoogle Scholar
  4. 4.
    Bonifati, A., Ciucanu, R., Lemay, A.: Learning path queries on graph databases. In: Proceedings of the 18th EDBT, pp. 109–120 (2015)Google Scholar
  5. 5.
    Brüggemann-Klein, A.: Unambiguity of extended regular expressions in SGML document grammars. In: Lengauer, T. (ed.) ESA 1993. LNCS, vol. 726, pp. 73–84. Springer, Heidelberg (1993). Scholar
  6. 6.
    Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 140(2), 229–253 (1998)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: Proceedings of the 14th DBPL, pp. 31–40 (2013)Google Scholar
  8. 8.
    Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Groz, B., Maneth, S.: Efficient testing and matching of deterministic regular expressions. J. Comput. Syst. Sci. 89, 372–399 (2017)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Hopcroft, J.E., Ullman, J.D.: Introduction To Automata Theory, Languages, and Computation. Addison-Wesley, Boston (2001)zbMATHGoogle Scholar
  12. 12.
    Huang, X., Bao, Z., Davidson, S.B., Milo, T., Yuan, X.: Answering regular path queries on workflow provenance. In: Proceedings of the 31st ICDE, pp. 375–386 (2015)Google Scholar
  13. 13.
    Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and schema data. In: Proceedings of the 22nd IDEAS, pp. 45–53 (2018)Google Scholar
  14. 14.
    Li, Y., Dong, C., Chu, X., Chen, H.: Learning DMEs from positive and negative examples. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds.) DASFAA 2019. LNCS, vol. 11448, pp. 434–438. Springer, Cham (2019). Scholar
  15. 15.
    Li, Y., Mou, X., Chen, H.: Learning concise relax NG schemas supporting interleaving from XML documents. In: Gan, G., Li, B., Li, X., Wang, S. (eds.) ADMA 2018. LNCS (LNAI), vol. 11323, pp. 303–317. Springer, Cham (2018). Scholar
  16. 16.
    Li, Y., Zhang, X., Cao, J., Chen, H., Gao, C.: Learning k-occurrence regular expressions with interleaving. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds.) DASFAA 2019. LNCS, vol. 11447, pp. 70–85. Springer, Cham (2019). Scholar
  17. 17.
    Li, Y., Zhang, X., Xu, H., Mou, X., Chen, H.: Learning restricted regular expressions with interleaving from XML data. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 586–593. Springer, Cham (2018). Scholar
  18. 18.
    Losemann, K., Martens, W.: The complexity of regular expressions and property paths in SPARQL. ACM Trans. Database Syst. 38(4), 24:1–24:39 (2013)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Losemann, K., Martens, W., Niewerth, M.: Closure properties and descriptional complexity of deterministic regular expressions. Theor. Comput. Sci. 627, 54–70 (2016)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for XML schemas and chain regular expressions. SIAM J. Comput. 39(4), 1486–1530 (2009)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Quinlan, J.R., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Zhang, X., Li, Y., Cui, F., Dong, C., Chen, H.: Inference of a concise regular expression considering interleaving from XML documents. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10938, pp. 389–401. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.State Key Laboratory of Computer Science, Institute of SoftwareChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations