Skip to main content

Discovering Restricted Regular Expressions with Interleaving

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9313))

Abstract

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.

Work supported by the National Natural Science Foundation of China under Grant Nos. 61472405, 61070038.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Bourhis, P., Vianu, V.: Highly expressive query languages for unordered data trees. In: Proceedings of the 15th International Conference on Database Theory, pp. 46–60 (2012)

    Google Scholar 

  2. Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 731–744 (2009)

    Google Scholar 

  3. Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. arXiv preprint arXiv:1303.4277 (2013)

    Google Scholar 

  4. Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 115–126. VLDB Endowment, September 2006

    Google Scholar 

  5. Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd international conference on Very large data bases, pp. 998–1009. VLDB Endowment, September 2007

    Google Scholar 

  6. Bex, G.J., Wouter, G., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Transactions on the Web (TWEB) 4(4), 14 (2010)

    Google Scholar 

  7. Ignatiev, A., Morgado, A., Marques-Silva, J.: On reducing maximum independent set to minimum satisfiability. In: Sinz, C., Egly, U. (eds.) SAT 2014. LNCS, vol. 8561, pp. 103–120. Springer, Heidelberg (2014)

    Google Scholar 

  8. Clark, J.: Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html

  9. Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bailey, R.W.: The number of weak orderings of a finite set. Social Choice and Welfare 15(4), 559–562 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  11. de Koninck, J.M.: Those Fascinating Numbers. American Mathematical Soc. (2009)

    Google Scholar 

  12. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56 (2013)

    Google Scholar 

  13. Ciucanu, R., Staworko, S.: Learning schemas for unordered xml. arXiv preprint arXiv:1307.6348 (2013)

    Google Scholar 

  14. Gionis, A., Kujala, T., Mannila, H.: Fragments of order. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–136 (2003)

    Google Scholar 

  15. Mannila, H., Meek, C.: Global partial orders from sequential data. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 161–168 (2000)

    Google Scholar 

  16. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, vol. 1215, pp. 487–499. VLDB (1994)

    Google Scholar 

  17. Pei, J., Wang, H., Liu, J., Wang, K., et al.: Discovering frequent closed partial orders from strings. IEEE Transactions on Knowledge and Data Engineering 18(11), 1467–1481 (2006)

    Article  Google Scholar 

  18. Miklau, G.: XMLData Repository, November 2002. http://www.cs.washington.edu/research/xmldatasets/

  19. Boppana, R., Halld\(\acute{o}\)rsson, M.M.: Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics 32(2), 180–196 (1992)

    Google Scholar 

  20. Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  21. Algorithm to divide a set of symbols with constraints into minimun number of subsets. http://stackoverflow.com/q/29117747/4684328

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feifei Peng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Peng, F., Chen, H. (2015). Discovering Restricted Regular Expressions with Interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25255-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25254-4

  • Online ISBN: 978-3-319-25255-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics