Advertisement

Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool

  • Justin F. Brunelle
  • Michael L. Nelson
  • Lyudmila Balakireva
  • Robert Sanderson
  • Herbert Van de Sompel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8092)

Abstract

Conventional Web archives are created by periodically crawling a Web site and archiving the responses from the Web server. Although easy to implement and commonly deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all content that has been served. Los Alamos National Laboratory has developed SiteStory, an open-source transactional archive written in Java that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used Apache’s ApacheBench utility on a pre-release version of SiteStory to measure response time and content delivery time in different environments. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

Keywords

Web Archiving Digital Preservation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adar, E., Dontcheva, M., Fogarty, J., Weld, D.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, pp. 239–248. ACM (2008)Google Scholar
  2. 2.
    Ainsworth, S., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In. In: JCDL 2011: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136 (2011)Google Scholar
  3. 3.
    Brewington, B., Cybenko, G., Coll, D., Hanover, N.: Keeping up with the changing Web. IEEE Computer 33(5), 52–58 (2000)CrossRefGoogle Scholar
  4. 4.
    Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)Google Scholar
  5. 5.
    Dyreson, C.E., Lin, H.-L., Wang, Y.: Managing versions of Web documents in a transaction-time Web server. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004 (2004)Google Scholar
  6. 6.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. Software: Practice and Experience 34(2), 213–237 (2004)CrossRefGoogle Scholar
  7. 7.
    Fitch, K.: Web site archiving: An approach to recording every materially different response produced by a Website. In: 9th Australasian World Wide Web Conference, pp. 5–9 (July 2003)Google Scholar
  8. 8.
    Hagedorn, K., Sentelli, J.: Google Still Not Indexing Hidden Web URLs. D-Lib Magazine 14(7) (August 2008), http://dlib.org/dlib/july08/hagedorn/07hagedorn.html
  9. 9.
    Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: Journey to the past: Proposal of a framework for past web browser. In: Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, pp. 135–144. ACM (2006)Google Scholar
  10. 10.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, pp. 437–446. ACM (2008)Google Scholar
  11. 11.
    Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing Time Travel for the Web. Code4Lib Journal 13 (2011)Google Scholar
  12. 12.
    Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010 (2010)Google Scholar
  13. 13.
    Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST 2009: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)Google Scholar
  14. 14.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states – Memento draft-vandesompel-memento-06 (2013), http://tools.ietf.org/pdf/draft-vandesompel-memento-06.pdf
  15. 15.
    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Technical Report arXiv:0911.1112 (2009)Google Scholar
  16. 16.
    Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S.: An HTTP-Based Versioning Mechanism for Linked Data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2010) (Also available as arXiv:1003.3661) (2010)Google Scholar
  17. 17.
    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002: Proceedings of the 11th International Conference on World Wide Web, pp. 136–147 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Justin F. Brunelle
    • 1
    • 2
  • Michael L. Nelson
    • 2
  • Lyudmila Balakireva
    • 3
  • Robert Sanderson
    • 3
  • Herbert Van de Sompel
    • 3
  1. 1.The MITRE CorporationHamptonUSA
  2. 2.Department of Computer ScienceOld Dominion UniversityNorfolkUSA
  3. 3.Los Alamos National LaboratoryLos AlamosUSA

Personalised recommendations