XBeGene: Scalable XML Documents Generator by Example Based on Real Data

  • Manami HarazakiEmail author
  • Joe Tekli
  • Shohei Yokoyama
  • Naoki Fukuta
  • Richard Chbeir
  • Hiroshi Ishikawa
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 156)


XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.


Occurrence Probability Document Collection Very Large Data Base SIGMOD Record High Correlation Level 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Tekli, J., Chbeir, R., Yétongnon, K.: A hybrid approach for XML similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 783–795. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  2. 2.
    Tekli, J., Chbeir, R., Yetongnon, K.: Extensible User-Based XML Grammar Matching. In: Laender, A.H.F., Castano, S., Dayal, U., Casati, F., de Oliveira, J.P.M. (eds.) ER 2009. LNCS, vol. 5829, pp. 294–314. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Helmer, S.: Measuring the Structural Similarity of Semistructured Documents Using Entropy. In: Proceedings of the International Conference on Very Large Databases (2007)Google Scholar
  4. 4.
    Cobena, G., Abiteboul, S., Marian, A.: Xydiff, tools for detecting changes in XML documents (2001),
  5. 5.
    Cobena, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE (2002)Google Scholar
  6. 6.
    Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science (29), 23–46 (2004)Google Scholar
  7. 7.
    Candillier, L., Tellier, I., Torre, F.: Transforming XML Trees for Efficient Classification and Clustering. In: Proceedings of the Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), pp. 469–480 (2005)Google Scholar
  8. 8.
    Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)CrossRefGoogle Scholar
  9. 9.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB, pp. 61–66 (2002)Google Scholar
  10. 10.
    Bosak, J.: The Plays of Shakespeare in XML (1999),
  11. 11.
    SIGMOD Record Document Collection,
  12. 12.
    The DBLP Computer Science Bibliography,
  13. 13.
    Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., Busse, R.: XMark: a Benchmark for XML Data Management. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)Google Scholar
  14. 14.
    Yao, B., Ozsu, M., Khandelwal, N.: XBench: Benchmark and Performance Testing of XML DBMSs. In: Proceedings of the International Conference on Data Engineering (2004)Google Scholar
  15. 15.
    Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., Al-Khalifa, S.: The Michigan Benchmark: towards XML Query Performance Diagnostics. Information Systems (2006)Google Scholar
  16. 16.
    Aboulnaga, A., Naughton, J., Zhang, C.: Generating Synthetic Complex-Structured XML Data. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 79–84 (2001)Google Scholar
  17. 17.
    Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: an Extensible Template-based Data Generator for XML. In: Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), pp. 49–54 (2002)Google Scholar
  18. 18.
    Cohen, S.: Generating XML Structure Using Examples and Constraints. In: Proceedings of the Very Large Data Bases Endowment (PVLBD), vol. 1(1), pp. 490–501 (2008)Google Scholar
  19. 19.
    Goldman, R., Widom, J.: data-guides: Query Formulation and Optimization in Semistructured Databases. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)Google Scholar
  20. 20.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA (2000)Google Scholar
  21. 21.
    Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Harazaki, M., Tekli, J., Yokoyama, S., Fukuta, N., Chbeir, R., Ishikawa, H.: Technical Report of the XBeGene: Scalable XML Documents Generator,
  23. 23.
    Mohamed, C.B., Yokoyama, S., Fukuta, N., Ishikawa, H., Chbeir, R.: New Approach for Computing Structural Similarity between XML Documents. Master’s thesis, Shizuoka University, Japan (2010)Google Scholar

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2013

Authors and Affiliations

  • Manami Harazaki
    • 1
    Email author
  • Joe Tekli
    • 2
  • Shohei Yokoyama
    • 1
  • Naoki Fukuta
    • 1
  • Richard Chbeir
    • 2
  • Hiroshi Ishikawa
    • 1
  1. 1.Department of Computer Science, Faculty of InformaticsShizuoka UniversityShizuokaJapan
  2. 2.LE2I Laboratory CNRSUniversity of BourgogneDijonFrance

Personalised recommendations