Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

  • Kjetil Nørvåg
  • Albert Overskeid Nybø
Part of the Studies in Computational Intelligence book series (SCI, volume 23)

Abstract

In research in web archives, large temporal document collections are necessary in order to be able to compare and evaluate new strategies and algorithms. Large temporal document collections are not easily available, and an alternative is to create synthetic document collections. In this paper we will describe how to generate synthetic temporal document collections, how this is realized in the TDocGen temporal document generator, and we will also present a study of the quality of the document collections created by TDocGen.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    D. Barbosa et al. ToXgene: a template-based data generator for XML. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 2002.Google Scholar
  2. 2.
    B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1–6):257–276, 2000.CrossRefGoogle Scholar
  3. 3.
    G. Cobena, S. Abiteboul, and A. Marian. Detecting changes in XML documents. In Proceedings of the 18th International Conference on Data Engineering, 2002.Google Scholar
  4. 4.
    D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of Web pages. Software — Practice and Experience, 34(2):213–237, 1996.CrossRefGoogle Scholar
  5. 5.
    H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., 1978.Google Scholar
  6. 6.
    G. Kazai et al. The INEX evaluation initiative. In Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks, 2003.Google Scholar
  7. 7.
    W. Li. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1992.Google Scholar
  8. 8.
    K. Nørvåg. The design, implementation, and performance of the V2 temporal document database system. Journal of Information and Software Technology, 46(9):557–574, 2004.Google Scholar
  9. 9.
    K. Runapongsa et al. The Michigan Benchmark: A microbenchmark for XML query processing systems. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web, 2002.Google Scholar
  10. 10.
    A. Schmidt et al. XMark: a benchmark for XML data management. In Proceedings of VLDB’2002, 2002.Google Scholar
  11. 11.
    G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kjetil Nørvåg
    • 1
  • Albert Overskeid Nybø
    • 1
  1. 1.Norwegian University of Science and TechnologyTrondheimNorway

Personalised recommendations