XBeGene: Scalable XML Documents Generator by Example Based on Real Data

Harazaki, Manami; Tekli, Joe; Yokoyama, Shohei; Fukuta, Naoki; Chbeir, Richard; Ishikawa, Hiroshi

doi:10.1007/978-3-642-28807-4_63

Manami Harazaki²,
Joe Tekli³,
Shohei Yokoyama²,
Naoki Fukuta²,
Richard Chbeir³ &
…
Hiroshi Ishikawa²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 156))

1598 Accesses
3 Citations
1 Altmetric

Abstract

XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tekli, J., Chbeir, R., Yétongnon, K.: A hybrid approach for XML similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 783–795. Springer, Heidelberg (2007)
Chapter Google Scholar
Tekli, J., Chbeir, R., Yetongnon, K.: Extensible User-Based XML Grammar Matching. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 294–314. Springer, Heidelberg (2009)
Chapter Google Scholar
Helmer, S.: Measuring the Structural Similarity of Semistructured Documents Using Entropy. In: Proceedings of the International Conference on Very Large Databases (2007)
Google Scholar
Cobena, G., Abiteboul, S., Marian, A.: Xydiff, tools for detecting changes in XML documents (2001), http://wwwrocq.inria.fr/?cobena/XyDiffWeb/
Cobena, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE (2002)
Google Scholar
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science (29), 23–46 (2004)
Google Scholar
Candillier, L., Tellier, I., Torre, F.: Transforming XML Trees for Efficient Classification and Clustering. In: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), pp. 469–480 (2005)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Article Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB, pp. 61–66 (2002)
Google Scholar
Bosak, J.: The Plays of Shakespeare in XML (1999), http://xml.coverpages.org/bosakShakespeare200.html
SIGMOD Record Document Collection, http://www.sigmod.org/record/xml/
The DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/ley/db/
Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., Busse, R.: XMark: a Benchmark for XML Data Management. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)
Google Scholar
Yao, B., Ozsu, M., Khandelwal, N.: XBench: Benchmark and Performance Testing of XML DBMSs. In: Proceedings of the International Conference on Data Engineering (2004)
Google Scholar
Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., Al-Khalifa, S.: The Michigan Benchmark: towards XML Query Performance Diagnostics. Information Systems (2006)
Google Scholar
Aboulnaga, A., Naughton, J., Zhang, C.: Generating Synthetic Complex-Structured XML Data. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 79–84 (2001)
Google Scholar
Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: an Extensible Template-based Data Generator for XML. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 49–54 (2002)
Google Scholar
Cohen, S.: Generating XML Structure Using Examples and Constraints. In: Proceedings of the Very Large Data Bases Endowment (PVLBD), vol. 1(1), pp. 490–501 (2008)
Google Scholar
Goldman, R., Widom, J.: data-guides: Query Formulation and Optimization in Semistructured Databases. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)
Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA (2000)
Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)
Chapter Google Scholar
Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H.: Technical Report of the XBeGene: Scalable XML Documents Generator, http://db-lab.cs.inf.shizuoka.ac.jp/paper/tech_xbegene.pdf
Mohamed, C.B., Yokoyama, S., Fukuta, N., Ishikawa, H., Chbeir, R.: New Approach for Computing Structural Similarity between XML Documents. Master’s thesis, Shizuoka University, Japan (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Informatics, Shizuoka University, Hamamatsu-shi, Shizuoka, Japan
Manami Harazaki, Shohei Yokoyama, Naoki Fukuta & Hiroshi Ishikawa
LE2I Laboratory CNRS, University of Bourgogne, 21076, Dijon, France
Joe Tekli & Richard Chbeir

Authors

Manami Harazaki
View author publications
You can also search for this author in PubMed Google Scholar
Joe Tekli
View author publications
You can also search for this author in PubMed Google Scholar
Shohei Yokoyama
View author publications
You can also search for this author in PubMed Google Scholar
Naoki Fukuta
View author publications
You can also search for this author in PubMed Google Scholar
Richard Chbeir
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manami Harazaki .

Editor information

Editors and Affiliations

Bina Nusantara University, Jakarta, 11480, Indonesia
Ford Lumban Gaol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H. (2013). XBeGene: Scalable XML Documents Generator by Example Based on Real Data. In: Gaol, F. (eds) Recent Progress in Data Engineering and Internet Technology. Lecture Notes in Electrical Engineering, vol 156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28807-4_63

Download citation

DOI: https://doi.org/10.1007/978-3-642-28807-4_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28806-7
Online ISBN: 978-3-642-28807-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics