Abstract
XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tekli, J., Chbeir, R., Yétongnon, K.: A hybrid approach for XML similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 783–795. Springer, Heidelberg (2007)
Tekli, J., Chbeir, R., Yetongnon, K.: Extensible User-Based XML Grammar Matching. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 294–314. Springer, Heidelberg (2009)
Helmer, S.: Measuring the Structural Similarity of Semistructured Documents Using Entropy. In: Proceedings of the International Conference on Very Large Databases (2007)
Cobena, G., Abiteboul, S., Marian, A.: Xydiff, tools for detecting changes in XML documents (2001), http://wwwrocq.inria.fr/?cobena/XyDiffWeb/
Cobena, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE (2002)
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science (29), 23–46 (2004)
Candillier, L., Tellier, I., Torre, F.: Transforming XML Trees for Efficient Classification and Clustering. In: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), pp. 469–480 (2005)
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB, pp. 61–66 (2002)
Bosak, J.: The Plays of Shakespeare in XML (1999), http://xml.coverpages.org/bosakShakespeare200.html
SIGMOD Record Document Collection, http://www.sigmod.org/record/xml/
The DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/ley/db/
Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., Busse, R.: XMark: a Benchmark for XML Data Management. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)
Yao, B., Ozsu, M., Khandelwal, N.: XBench: Benchmark and Performance Testing of XML DBMSs. In: Proceedings of the International Conference on Data Engineering (2004)
Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., Al-Khalifa, S.: The Michigan Benchmark: towards XML Query Performance Diagnostics. Information Systems (2006)
Aboulnaga, A., Naughton, J., Zhang, C.: Generating Synthetic Complex-Structured XML Data. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 79–84 (2001)
Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: an Extensible Template-based Data Generator for XML. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 49–54 (2002)
Cohen, S.: Generating XML Structure Using Examples and Constraints. In: Proceedings of the Very Large Data Bases Endowment (PVLBD), vol. 1(1), pp. 490–501 (2008)
Goldman, R., Widom, J.: data-guides: Query Formulation and Optimization in Semistructured Databases. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA (2000)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)
Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H.: Technical Report of the XBeGene: Scalable XML Documents Generator, http://db-lab.cs.inf.shizuoka.ac.jp/paper/tech_xbegene.pdf
Mohamed, C.B., Yokoyama, S., Fukuta, N., Ishikawa, H., Chbeir, R.: New Approach for Computing Structural Similarity between XML Documents. Master’s thesis, Shizuoka University, Japan (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag GmbH Berlin Heidelberg
About this paper
Cite this paper
Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H. (2013). XBeGene: Scalable XML Documents Generator by Example Based on Real Data. In: Gaol, F. (eds) Recent Progress in Data Engineering and Internet Technology. Lecture Notes in Electrical Engineering, vol 156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28807-4_63
Download citation
DOI: https://doi.org/10.1007/978-3-642-28807-4_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28806-7
Online ISBN: 978-3-642-28807-4
eBook Packages: EngineeringEngineering (R0)