A First Experience in Archiving the French Web

  • S. Abiteboul
  • G. Cobéna
  • J. Masanes
  • G. Sedrati
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2458)


The web is a more and more valuable source of information and organizations are involved in archiving (portions of) it for various purposes, e.g., the Internet Archive A new mission of the French National Library (BnF) is the “dépôt légal” (legal deposit) of the French web. We describe here some preliminary work on the topic conducted by BnF and INRIA. In particular, we consider the acquisition of the web archive. Issues are the definition of the perimeter of the French web and the choice of pages to read once or more times (to take changes into account). When several copies of the same page are kept, this leads to versioning issues that we briefly consider. Finally, we mention some first experiments.


Mirror Site Internet Archive Dynamic Page Page Level Link Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    S. Abiteboul, M. Preda, and G. Cobena. Computing web page importance without storing the graph of the web (extended abstract). In IEEE Data Engineering Bulletin, Volume 25, 2002.Google Scholar
  2. [2]
    A. Arvidson, K. Persson, and J. Mannerheim. The kulturarw3 project— the royal swedish web archiw3e— an example of “complete” collection of web pages. In 66th IFLA Council andGener al Conference, 2000.
  3. [3]
    M.K. Bergman. The deep web: Surfacing hidden value.
  4. [4]
    Google. Google news search.
  5. [5]
  6. [6]
    Maria Halkidi, Benjamin Nguyen, Iraklis Varlamis, and Mihalis Vazirgianis. Thesus: Organising web document collections based on semantics and clustering. Technical Report, 2002.Google Scholar
  7. [7]
    T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford University, 1999.Google Scholar
  8. [8]
    H. Garcia-Molina J. Cho. Synchronizing a database to improve freshness. SIGMOD, 2000.Google Scholar
  9. [9]
    R. Lafontaine. A delta format for XML: Identifying changes in XML and representing the changes in XML. In XML Europe, 2001.Google Scholar
  10. [10]
    A. Marian, S. Abiteboul, G. Cobena, and L. Mignet. Change-centric management of versions in an XML warehouse. VLDB, 2001.Google Scholar
  11. [11]
    L. Martin. Networked electronic publications policy, 1999.
  12. [12]
    J. Masanes. Pr server les contenus du web. In IVe journ es internationales d’tudes de l’ARSAG— La conservation l’ re du num rique, 2002.Google Scholar
  13. [13]
    J. Masan s. The BnF’s project for web archiving. In What’s next for Digital Deposit Libraries? ECDL Workshop, 2001.
  14. [14]
    L. Mignet, M. Preda, S. Abiteboul, S. Ailleret, B. Amann, and A. Marian. Acquiring XML pages for a WebHouse. In proceedings of Base de Donn es Avanc esconference, 2000.Google Scholar
  15. [16]
    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web, 1998.Google Scholar
  16. [17]
    S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In The VLDB Journal, 2001.Google Scholar
  17. [18]
    L. Page S. Brin. The anatomy of a large-scale hypertextual web search engine. WWW7 Conference, Computer Networks 30(1–7), 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • S. Abiteboul
    • 1
    • 3
  • G. Cobéna
    • 1
  • J. Masanes
    • 2
  • G. Sedrati
    • 3
  1. 1.INRIAFrance
  2. 2.BnFFrance
  3. 3.XylemeFrance

Personalised recommendations