Design and Selection Criteria for a National Web Archive

  • Daniel Gomes
  • Sérgio Freitas
  • Mário J. Silva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4172)

Abstract

Web archives and Digital Libraries are conceptually similar, as they both store and provide access to digital contents. The process of loading documents into a Digital Library usually requires a strong intervention from human experts. However, large collections of documents gathered from the web must be loaded without human intervention. This paper analyzes strategies to select contents for a national web archive and proposes a system architecture to support it.

Keywords

Digital Library Archive Data National Library Media Type Archive Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G.: A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 1–15. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Albertsen, K.: The paradigma web haravesting environment. In: Proceedings of 3rd ECDL Workshop on Web Archives, Trondheim, Norway (August 2003)Google Scholar
  3. 3.
    Campos, J.: Versus: a web repository. Master thesis (2003)Google Scholar
  4. 4.
    U.W.A. Consortium. Uk web archiving consortium: Project overview (January 2006), http://info.webarchive.org.uk/
  5. 5.
    P.D. Corporation. Perseus blog survey (September 2004)Google Scholar
  6. 6.
    Day, M.: Collecting and preserving the world wide web (2003), http://www.jisc.ac.uk/uploaded_documents/archiving_feasibility.pdf
  7. 7.
    Drugeon, T.: A technical approach for the french web legal deposit. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)Google Scholar
  8. 8.
    Entlich, R.: Bolg today, gone tomorrow? preservation of weblogs. RLG Diginews 8(4) (August 2004)Google Scholar
  9. 9.
    Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: Liebrock, L.M. (ed.) Proceedings of the 21st Annual ACM Symposium on Applied Computing (ACM-SAC 2006), Dijon, France (Aprill 2006)Google Scholar
  10. 10.
    Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Inter. Tech. 5(3), 508–531 (2005)CrossRefGoogle Scholar
  11. 11.
    Gordon Mohr, M.S.I.R., Kimpton, M.: Introdcution to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop (IWAW 2004), Bath, UK, September 2004. Internet Archive, USA (2004)Google Scholar
  12. 12.
    Habib, M.A., Abrams, M.: Analysis of sources of latency in downloading web pages. In: WebNet, San Antonio, Texas, USA (November 2000)Google Scholar
  13. 13.
    Hakala, J.: Collecting and preserving the web: Developing and testing the nedilb harvester. RLG Diginews 5(2) (April 2001)Google Scholar
  14. 14.
    Hawking, D., Craswell, N.: Very large scale retrieval and web search. In: Voorhees, E., Harman, D. (eds.) The TREC Book. MIT Press, Cambridge (2004)Google Scholar
  15. 15.
    Heydon, A., Najork, M.: Mercator: A scalable, extensble web crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  16. 16.
    Koster, M.: A standard for robot exclusion (June 1994), http://www.robotstxt.org/wc/norobots.html
  17. 17.
    Kunze, J., Arvidson, A., Mohr, G., Stack, M.: The WARC File Format (Version 0.8 rev B) (January 2006)Google Scholar
  18. 18.
    Marshak, M., Levy, H.: Evaluating web user perceived latency using server side measurements. Computer Communications 26(8), 872–887 (2003)CrossRefGoogle Scholar
  19. 19.
    McCown, F.: Dynamic web file format transformations with grace. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)Google Scholar
  20. 20.
    National Library of Australia. Padi-Web archiving, January 18 (2006), http://www.nla.gov.au/padi/topics/92.html
  21. 21.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th international conference on World Wide Web, pp. 1–12. ACM Press, New York (2004)Google Scholar
  22. 22.
    Phillips, M.: PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports it. DigiCULT.Info, 24 (2003)Google Scholar
  23. 23.
    Rauber, A., Aschenbrenner, A., Witvoet, O.: Austrian on-line archive processing: Analyzing archives of the world wide web (2002)Google Scholar
  24. 24.
    Snyder, H., Rosenbaum, H.: How public is the web?: Robots, acces, and scholarly communication. Working paper WP-98-05, Center for Socila Informatics, Indiana University, Bloomington, IN USA 47405-1801 (January 1998)Google Scholar
  25. 25.
    The Library of Congress. Minerva home page (Mapping the internet electronic resources virtual archive, library of congress web archiving) (Januray 2006), http://lcweb2.loc.gov/cocoon/minerva/html/minerva-home.html
  26. 26.
    The Web Robots Pages. Html author’s guide to the robots meta tag (March 2005), http://www.robotstxt.org/wc/meta/-user.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Daniel Gomes
    • 1
  • Sérgio Freitas
    • 1
  • Mário J. Silva
    • 1
  1. 1.Faculty of SciencesUniversity of LisbonLisboaPortugal

Personalised recommendations