Skip to main content

Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Abstract

Conventional Web archives are created by periodically crawling a Web site and archiving the responses from the Web server. Although easy to implement and commonly deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all content that has been served. Los Alamos National Laboratory has developed SiteStory, an open-source transactional archive written in Java that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used Apache’s ApacheBench utility on a pre-release version of SiteStory to measure response time and content delivery time in different environments. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adar, E., Dontcheva, M., Fogarty, J., Weld, D.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, pp. 239–248. ACM (2008)

    Google Scholar 

  2. Ainsworth, S., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In. In: JCDL 2011: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136 (2011)

    Google Scholar 

  3. Brewington, B., Cybenko, G., Coll, D., Hanover, N.: Keeping up with the changing Web. IEEE Computer 33(5), 52–58 (2000)

    Article  Google Scholar 

  4. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)

    Google Scholar 

  5. Dyreson, C.E., Lin, H.-L., Wang, Y.: Managing versions of Web documents in a transaction-time Web server. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004 (2004)

    Google Scholar 

  6. Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2), 213–237 (2004)

    Article  Google Scholar 

  7. Fitch, K.: Web site archiving: An approach to recording every materially different response produced by a Website. In: 9th Australasian World Wide Web Conference, pp. 5–9 (July 2003)

    Google Scholar 

  8. Hagedorn, K., Sentelli, J.: Google Still Not Indexing Hidden Web URLs. D-Lib Magazine 14(7) (August 2008), http://dlib.org/dlib/july08/hagedorn/07hagedorn.html

  9. Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: Journey to the past: Proposal of a framework for past web browser. In: Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, pp. 135–144. ACM (2006)

    Google Scholar 

  10. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, pp. 437–446. ACM (2008)

    Google Scholar 

  11. Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing Time Travel for the Web. Code4Lib Journal 13 (2011)

    Google Scholar 

  12. Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010 (2010)

    Google Scholar 

  13. Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST 2009: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)

    Google Scholar 

  14. Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states – Memento draft-vandesompel-memento-06 (2013), http://tools.ietf.org/pdf/draft-vandesompel-memento-06.pdf

  15. Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Technical Report arXiv:0911.1112 (2009)

    Google Scholar 

  16. Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S.: An HTTP-Based Versioning Mechanism for Linked Data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2010) (Also available as arXiv:1003.3661) (2010)

    Google Scholar 

  17. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002: Proceedings of the 11th International Conference on World Wide Web, pp. 136–147 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brunelle, J.F., Nelson, M.L., Balakireva, L., Sanderson, R., Van de Sompel, H. (2013). Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40501-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40500-6

  • Online ISBN: 978-3-642-40501-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics