International Journal on Digital Libraries

, Volume 17, Issue 3, pp 223–238

Web archive profiling through CDX summarization

  • Sawood Alam
  • Michael L. Nelson
  • Herbert Van de Sompel
  • Lyudmila L. Balakireva
  • Harihar Shankar
  • David S. H. Rosenthal
Article

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.

Keywords

Web archives Profiling CDX files Memento Query routing 

References

  1. 1.
    Alam, S., Cartledge, C.L., Nelson, M.L.: Support for various HTTP methods on the web. Tech. Rep (2014). arXiv:1405.2330
  2. 2.
    Alam, S., Kreymer, I., Nelson, M.L.: Object resource stream (ORS) and CDX-JSON (CDXJ) draft (2015). https://github.com/oduwsdl/ORS
  3. 3.
    Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Proceedings of 19th international conference on theory and practice of digital libraries. TPDL 2015, vol. 9316, pp. 3–14. Poznań, Poland (2015)Google Scholar
  4. 4.
    AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the Internet Archive. Int. J. Digit. Librar. 14(3–4), 101–115 (2014)CrossRefGoogle Scholar
  5. 5.
    AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Proc. Int. Conf. Theory Pract. Digit. Librar. TPDL 2013, 60–71 (2013)Google Scholar
  6. 6.
    AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Librar. 14(3–4), 149–166 (2014)CrossRefGoogle Scholar
  7. 7.
    Ben-Kiki, O., Evans, C., Ingy döt Net: YAML Ain’t Markup Language (YAML\(^{{\rm TM}}\)) Version 1.2 (2009). http://www.yaml.org/spec/1.2/spec.html
  8. 8.
    Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries, JCDL ’16, pp. 63–72 (2016). doi:10.1145/2910896.2910899
  9. 9.
    Crockford, D.: The application/json media type for javascript object notation (JSON). RFC 4627 (2006)Google Scholar
  10. 10.
    Deutsch, P.: GZIP file format specification version 4.3. RFC 1952 (1996)Google Scholar
  11. 11.
    Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)CrossRefGoogle Scholar
  12. 12.
    Gravano, L., Chang, C.C.K., García-Molina, H., Paepcke, A.: STARTS: stanford proposal for internet meta-searching. SIGMOD Rec. 26(2), 207–218 (1997). doi:10.1145/253262.253299 CrossRefGoogle Scholar
  13. 13.
    Internet archive: CDX file format (2003). http://archive.org/web/researcher/cdx_file_format.php
  14. 14.
    Internet archive: archive-it -web archiving services for libraries and archives (2006). https://www.archive-it.org/
  15. 15.
    ISO 28500: WARC (Web ARChive) file format (2009). http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
  16. 16.
    Liu, L.: Query routing in large-scale digital library systems. In: 15th International Conference on Data Engineering, 1999. Proceedings, pp. 154–163 (1999). doi:10.1109/ICDE.1999.754918
  17. 17.
    Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)CrossRefGoogle Scholar
  18. 18.
    Mozilla Foundation: Public Suffix List (2015). https://publicsuffix.org/
  19. 19.
    Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, pp. 379–380. ACM, New York (2012)Google Scholar
  20. 20.
    Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC memento aggregator experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
  21. 21.
    Sigursson, K., Stack, M., Ranitovic, I.: Heritrix user manual: sort-friendly URI reordering transform (2006). http://crawler.archive.org/articles/user_manual/glossary.html#surt
  22. 22.
    Sporny, M., Kellogg, G., Lanthaler, M.: A JSON-based serialization for linked data. W3C Recommendation (2014)Google Scholar
  23. 23.
    Stanford University Libraries: Stanford Web Archive Portal (2013). https://swap.stanford.edu/
  24. 24.
    Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)Google Scholar
  25. 25.
    Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)CrossRefGoogle Scholar
  26. 26.
    UK Web Archive: Crawled URL Index JISC UK Web Domain Dataset (1996–2013) (2014). doi:10.5259/ukwa.ds.2/cdx/1
  27. 27.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento. RFC 7089 (2013)Google Scholar
  28. 28.
    Weka: Attribute-relation file format (ARFF) (2009). http://weka.wikispaces.com/ARFF

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Computer Science DepartmentOld Dominion UniversityNorfolkUSA
  2. 2.Los Alamos National LaboratoryLos AlamosUSA
  3. 3.Stanford University LibrariesStanfordUSA

Personalised recommendations