Skip to main content

XBeGene: Scalable XML Documents Generator by Example Based on Real Data

  • Conference paper
Recent Progress in Data Engineering and Internet Technology

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 156))

Abstract

XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tekli, J., Chbeir, R., Yétongnon, K.: A hybrid approach for XML similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 783–795. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Tekli, J., Chbeir, R., Yetongnon, K.: Extensible User-Based XML Grammar Matching. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 294–314. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  3. Helmer, S.: Measuring the Structural Similarity of Semistructured Documents Using Entropy. In: Proceedings of the International Conference on Very Large Databases (2007)

    Google Scholar 

  4. Cobena, G., Abiteboul, S., Marian, A.: Xydiff, tools for detecting changes in XML documents (2001), http://wwwrocq.inria.fr/?cobena/XyDiffWeb/

  5. Cobena, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE (2002)

    Google Scholar 

  6. Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science (29), 23–46 (2004)

    Google Scholar 

  7. Candillier, L., Tellier, I., Torre, F.: Transforming XML Trees for Efficient Classification and Clustering. In: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), pp. 469–480 (2005)

    Google Scholar 

  8. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)

    Article  Google Scholar 

  9. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB, pp. 61–66 (2002)

    Google Scholar 

  10. Bosak, J.: The Plays of Shakespeare in XML (1999), http://xml.coverpages.org/bosakShakespeare200.html

  11. SIGMOD Record Document Collection, http://www.sigmod.org/record/xml/

  12. The DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/ley/db/

  13. Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., Busse, R.: XMark: a Benchmark for XML Data Management. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)

    Google Scholar 

  14. Yao, B., Ozsu, M., Khandelwal, N.: XBench: Benchmark and Performance Testing of XML DBMSs. In: Proceedings of the International Conference on Data Engineering (2004)

    Google Scholar 

  15. Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., Al-Khalifa, S.: The Michigan Benchmark: towards XML Query Performance Diagnostics. Information Systems (2006)

    Google Scholar 

  16. Aboulnaga, A., Naughton, J., Zhang, C.: Generating Synthetic Complex-Structured XML Data. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 79–84 (2001)

    Google Scholar 

  17. Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: an Extensible Template-based Data Generator for XML. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 49–54 (2002)

    Google Scholar 

  18. Cohen, S.: Generating XML Structure Using Examples and Constraints. In: Proceedings of the Very Large Data Bases Endowment (PVLBD), vol. 1(1), pp. 490–501 (2008)

    Google Scholar 

  19. Goldman, R., Widom, J.: data-guides: Query Formulation and Optimization in Semistructured Databases. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)

    Google Scholar 

  20. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA (2000)

    Google Scholar 

  21. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  22. Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H.: Technical Report of the XBeGene: Scalable XML Documents Generator, http://db-lab.cs.inf.shizuoka.ac.jp/paper/tech_xbegene.pdf

  23. Mohamed, C.B., Yokoyama, S., Fukuta, N., Ishikawa, H., Chbeir, R.: New Approach for Computing Structural Similarity between XML Documents. Master’s thesis, Shizuoka University, Japan (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manami Harazaki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag GmbH Berlin Heidelberg

About this paper

Cite this paper

Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H. (2013). XBeGene: Scalable XML Documents Generator by Example Based on Real Data. In: Gaol, F. (eds) Recent Progress in Data Engineering and Internet Technology. Lecture Notes in Electrical Engineering, vol 156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28807-4_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28807-4_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28806-7

  • Online ISBN: 978-3-642-28807-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics