International Conference on Theory and Practice of Digital Libraries

Research and Advanced Technology for Digital Libraries pp 3-14 | Cite as

Web Archive Profiling Through CDX Summarization

  • Sawood Alam
  • Michael L. Nelson
  • Herbert Van de Sompel
  • Lyudmila L. Balakireva
  • Harihar Shankar
  • David S. H. Rosenthal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9316)

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the CDX files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we gained up to 22 % routing precision with less than 5 % relative cost as compared to the complete knowledge profile without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a five fold increase in routing precision.

Keywords

Web archives Profiling CDX Files Memento 

References

  1. 1.
    Alam, S., Cartledge, C.L., Nelson, M.L.: Support for Various HTTP Methods on the Web. Technical report. arXiv:1405.2330 (2014)
  2. 2.
    AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the Internet Archive. Int. J. Digit. Libr. 14(3–4), 101–115 (2014)CrossRefGoogle Scholar
  3. 3.
    Alsum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 60–71. Springer, Heidelberg (2013)Google Scholar
  4. 4.
    AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)CrossRefGoogle Scholar
  5. 5.
    Crockford, D.: The application/json media type for JavaScript Object Notation (JSON). RFC 4627 (2006)Google Scholar
  6. 6.
    Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inform. Sci. Technol. 58(5), 702–709 (2007)CrossRefGoogle Scholar
  7. 7.
    Gailly, J., Adler, M.: GZIP File Format (2013). http://www.gzip.org/
  8. 8.
    Internet Archive: CDX File Format. http://archive.org/web/researcher/cdx_file_format.php (2003)
  9. 9.
    Internet Archive: Archive-It - Web Archiving Services for Libraries and Archives (2006). https://www.archive-it.org/
  10. 10.
    ISO 28500: WARC (Web ARChive) file format (2009). http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
  11. 11.
    Mozilla Foundation: Public Suffix List (2015). https://publicsuffix.org/
  12. 12.
    Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 379–380. ACM (2012)Google Scholar
  13. 13.
    Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC Memento Aggregator Experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
  14. 14.
    Sigursson, K., Stack, M., Ranitovic, I.: Heritrix User Manual: Sort-friendly URI Reordering Transform (2006). http://crawler.archive.org/articles/user_manual/glossary.html#surt
  15. 15.
    Sporny, M., Kellogg, G., Lanthaler, M.: A JSON-based serialization for linked data. W3C Recommendation (2014)Google Scholar
  16. 16.
    UK Web Archive: Crawled URL Index JISC UK Web Domain Dataset (1996–2013) (2014). doi:10.5259/ukwa.ds.2/cdx/1
  17. 17.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States - Memento. RFC 7089, December 2013Google Scholar
  18. 18.
    Weka: Attribute-Relation File Format (ARFF) (2009). http://weka.wikispaces.com/ARFF

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sawood Alam
    • 1
  • Michael L. Nelson
    • 1
  • Herbert Van de Sompel
    • 2
  • Lyudmila L. Balakireva
    • 2
  • Harihar Shankar
    • 2
  • David S. H. Rosenthal
    • 3
  1. 1.Computer Science DepartmentOld Dominion UniversityNorfolkUSA
  2. 2.Los Alamos National LaboratoryLos AlamosUSA
  3. 3.Stanford University LibrariesStanfordUSA

Personalised recommendations