Discovering Restricted Regular Expressions with Interleaving

Peng, Feifei; Chen, Haiming

doi:10.1007/978-3-319-25255-1_9

Discovering Restricted Regular Expressions with Interleaving

Feifei Peng^18,19 &
Haiming Chen¹⁸

Conference paper
First Online: 13 November 2015

2837 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9313))

Abstract

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.

Work supported by the National Natural Science Foundation of China under Grant Nos. 61472405, 61070038.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Bourhis, P., Vianu, V.: Highly expressive query languages for unordered data trees. In: Proceedings of the 15th International Conference on Database Theory, pp. 46–60 (2012)
Google Scholar
Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML schema: effortless handling of nondeterministic regular expressions. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 731–744 (2009)
Google Scholar
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. arXiv preprint arXiv:1303.4277 (2013)
Google Scholar
Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 115–126. VLDB Endowment, September 2006
Google Scholar
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd international conference on Very large data bases, pp. 998–1009. VLDB Endowment, September 2007
Google Scholar
Bex, G.J., Wouter, G., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Transactions on the Web (TWEB) 4(4), 14 (2010)
Google Scholar
Ignatiev, A., Morgado, A., Marques-Silva, J.: On reducing maximum independent set to minimum satisfiability. In: Sinz, C., Egly, U. (eds.) SAT 2014. LNCS, vol. 8561, pp. 103–120. Springer, Heidelberg (2014)
Google Scholar
Clark, J.: Trang: Multi-format schema converter based on RELAX NG. http://www.thaiopensource.com/relaxng/trang.html
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Article MathSciNet MATH Google Scholar
Bailey, R.W.: The number of weak orderings of a finite set. Social Choice and Welfare 15(4), 559–562 (1998)
Article MathSciNet MATH Google Scholar
de Koninck, J.M.: Those Fascinating Numbers. American Mathematical Soc. (2009)
Google Scholar
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56 (2013)
Google Scholar
Ciucanu, R., Staworko, S.: Learning schemas for unordered xml. arXiv preprint arXiv:1307.6348 (2013)
Google Scholar
Gionis, A., Kujala, T., Mannila, H.: Fragments of order. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–136 (2003)
Google Scholar
Mannila, H., Meek, C.: Global partial orders from sequential data. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 161–168 (2000)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, vol. 1215, pp. 487–499. VLDB (1994)
Google Scholar
Pei, J., Wang, H., Liu, J., Wang, K., et al.: Discovering frequent closed partial orders from strings. IEEE Transactions on Knowledge and Data Engineering 18(11), 1467–1481 (2006)
Article Google Scholar
Miklau, G.: XMLData Repository, November 2002. http://www.cs.washington.edu/research/xmldatasets/
Boppana, R., Halld\(\acute{o}\)rsson, M.M.: Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics 32(2), 180–196 (1992)
Google Scholar
Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517 (1977)
Article MathSciNet MATH Google Scholar
Algorithm to divide a set of symbols with constraints into minimun number of subsets. http://stackoverflow.com/q/29117747/4684328

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Feifei Peng & Haiming Chen
University of Chinese Academy of Sciences, Beijing, China
Feifei Peng

Authors

Feifei Peng
View author publications
You can also search for this author in PubMed Google Scholar
Haiming Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feifei Peng .

Editor information

Editors and Affiliations

University of Hong Kong, Hong Kong, China
Reynold Cheng
Computer Science, Peking University, Beijing, China
Bin Cui
Advanced Digital Sciences Center (ADSC), Singapore, Singapore
Zhenjie Zhang
University of Technology, Guangzhou, China
Ruichu Cai
Guangxi University, Guangxi, China
Jia Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, F., Chen, H. (2015). Discovering Restricted Regular Expressions with Interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-25255-1_9
Published: 13 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25254-4
Online ISBN: 978-3-319-25255-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics