International Journal on Digital Libraries

, Volume 14, Issue 3–4, pp 149–166 | Cite as

Profiling web archive coverage for top-level domain and content language

  • Ahmed AlSum
  • Michele C. Weigle
  • Michael L. Nelson
  • Herbert Van de Sompel
Article

Abstract

The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define \(Recall_{TM}(n)\) as the percentage of a TimeMap that was returned using \(n\) web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average \(Recall_{TM}=0.96\). If we exclude the Internet Archive from the list, we can reach \(Recall_{TM}=0.647\) on average using only the remaining top three web archives.

Keywords

Web archive Federated search  Memento Aggregator 

References

  1. 1.
    ISO 639–3. URL http://www-01.sil.org/iso639-3/. Accessed 30 Oct 2013
  2. 2.
    Ainsworth, S.G., AlSum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is Archived? In: Proceedings of the 11th annual international ACM/IEEE Joint Conference on Digital libraries, JCDL ’11, pp. 133–136 (2011)Google Scholar
  3. 3.
    AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the internet archive. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL ’13, pp. 346–357 (2013)Google Scholar
  4. 4.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 339–348 (2013)Google Scholar
  5. 5.
    AlSum, A., Weigle, M., Nelson, M., Sompel, H.: Profiling Web Archive Coverage for Top-Level Domain and Content Language. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Proceeding of the 17th International Conference of Theory of Practice of Digital Libraries, TPDL 2013, pp. 60–71. Springer, Berlin Heidelberg (2013)Google Scholar
  6. 6.
    Aubry, S.: Introducing web archives as a new library service: the experience of the national library of France. LIBER Q. 20(2), 179–199 (2010)MathSciNetGoogle Scholar
  7. 7.
    Baeza-Yates, R., Riberio-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Professional, London (2011)Google Scholar
  8. 8.
    Bailey, S., Thompson, D.: UKWAC building the UK’s first public web archive. D-Lib Mag. 12(1), 1082–9873 (2006)Google Scholar
  9. 9.
    Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) String Processing and Information Retrieval SE-26. Lecture Notes in Computer Science, vol. 4209, pp. 316–328. Springer, Berlin Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM (JACM) 55(5), 24 (2008)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Braden, R.: RFC 1123-Requirements for Internet Hosts-Application and Support (1989). URL http://www.ietf.org/rfc/rfc1123.txt
  12. 12.
    Brown, A.: Archiving Websites: A Practical Guide for Information Management Professionals, 1st edn. Facet, London (2006)Google Scholar
  13. 13.
    Brügger, N.: Archiving Websites. General Considerations and Strategies, 1st edn. The Center for Internet Research, Aarhus N (2005)Google Scholar
  14. 14.
    Brunelle, J.F., Nelson, M.L.: An evaluation of caching policies for Memento Timemaps. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’13, pp. 267–276. ACM Press, New York (2013)Google Scholar
  15. 15.
    Callan, J.: Distributed information retrieval. In: Croft, W. (ed.) Advances in Information Retrieval SE-5, The Information Retrieval Series, vol. 7, pp. 127–150. Springer, New York (2000)Google Scholar
  16. 16.
    Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. 19(2), 97–130 (2001)CrossRefGoogle Scholar
  17. 17.
    Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. ACM SIGMOD Record 28(2), 479–490 (1999)CrossRefGoogle Scholar
  18. 18.
    Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’95, pp. 21–28. ACM Press, New York (1995)Google Scholar
  19. 19.
    Chakrabarti, S., Joshi, M.M., Punera, K., Pennock, D.M.: The structure of broad topics on the web. In: Proceedings of the 11th international conference on World Wide Web. WWW ’02, pp. 251–260. ACM Press, New York (2002)Google Scholar
  20. 20.
    Chen, K., Chen, Y., Ting, P.: Developing national Taiwan university web archiving system. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)Google Scholar
  21. 21.
    Clausen, L.R.: Overview of the Netarkivet web archiving system. In: Proceedings of 6th International Web Archiving Workshop, IWAW ’06 (2006)Google Scholar
  22. 22.
    Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the fifth ACM conference on Digital libraries. DL ’00, pp. 37–46. ACM Press, New York (2000)Google Scholar
  23. 23.
    D’Souza, D.J., Thom, J.A., Zobel, J.: A comparison of techniques for selecting text collections. In: Proceedings of 11th Australasian Database Conference, ADC 2000, pp. 28–32 (2000)Google Scholar
  24. 24.
    Gomes, D., Nogueira, A., Miranda, J.a., Costa, M.: Introducing the Portuguese web archive initiative. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)Google Scholar
  25. 25.
    Gravano, L., García-Molina, H., Tomasic, A.: The effectiveness of GIOSS for the text database discovery problem. ACM SIGMOD Record 23(2), 126–137 (1994)CrossRefGoogle Scholar
  26. 26.
    Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)CrossRefGoogle Scholar
  27. 27.
    Grotke, A.: IIPC 2008 Member Profile Survey Results. Tech. rep., International Internet Preservation Consortium Publications (2008). URL http://www.netpreserve.org/resources/2008-iipc-member-profile-survey-results
  28. 28.
    Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: International World Wide Web Conference, pp. 902–903 (2005)Google Scholar
  29. 29.
    Heslop, H., Davis, S., Wilson, A.: An Approach to the Preservation of Digital Records. Tech. rep., National Archives of Australia (2002). URL http://www.naa.gov.au/Images/An-approach-Green-Paper_tcm16-47161
  30. 30.
    Heuser, C.A., Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)CrossRefGoogle Scholar
  31. 31.
    Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: hierarchical database sampling and selection. In: Proceeding of the 28th Very-Large Database conference, VLDB ’02, pp. 394–405 (2002)Google Scholar
  32. 32.
    Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify. ACM SIGMOD Record 30(2), 67–78 (2001)CrossRefGoogle Scholar
  33. 33.
    Kavcic-colic, A., Grobelnik, M.: Archiving the Slovenian Web : Recent Experiences. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004)Google Scholar
  34. 34.
    Losee, R., Church, L.: Information retrieval with distributed databases: analytic models of performance. IEEE Transactions on Parallel and Distributed Systems 15(1), 18–27 (2004)CrossRefGoogle Scholar
  35. 35.
    Lu, J., Callan, J.: Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In: Proceedings of 27th European Conference on Information Retrieval Research, ECIR ’05, pp. 52–66 (2005)Google Scholar
  36. 36.
    Masanès, J.: Web Archiving. Springer, Berlin, Heidelberg (2006)CrossRefGoogle Scholar
  37. 37.
    Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)CrossRefGoogle Scholar
  38. 38.
    Monroe, G., French, J., Powell, A.: Obtaining language models of web collections using query-based sampling techniques. Hawaii Int. Conf. Syst. Sci. 3, 67b (2002)Google Scholar
  39. 39.
    Niu, J.: An overview of web archiving. D-Lib Mag. 18(3/4) (2012)Google Scholar
  40. 40.
    Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012)Google Scholar
  41. 41.
    Phillips, A., Davis, M.: RFC 5646-Tags for Identifying Languages (2009). URL http://tools.ietf.org/html/rfc5646
  42. 42.
    Powell, A.L., French, J.C.: Comparing the performance of collection selection algorithms. ACM Trans. Inform. Syst. 21(4), 412–456 (2003)CrossRefGoogle Scholar
  43. 43.
    Preibusch, S., Bonneau, J.: The privacy landscape: product differentiation on data collection. In: Schneier, B. (ed.) Economics of Information Security and Privacy III SE-12, pp. 263–283. Springer, New York (2013)CrossRefGoogle Scholar
  44. 44.
    Rossi, A.: Fixing Broken Links on the Internet (2013). URL http://blog.archive.org/2013/10/25/fixing-broken-links/
  45. 45.
    Sanderson, R.: Memento Tools: Proxy Scripts (2010). URL http://www.mementoweb.org/tools/proxy/
  46. 46.
    Sanderson, R., Shankar, H., AlSum, A.: Memento aggregator source code (2010). URL https://code.google.com/p/memento-server
  47. 47.
    Shiozaki, R., Eisenschitz, T.: Role and justification of web archiving by national libraries: a questionnaire survey. J. Libr. Inform. Sci. 41(2), 90–107 (2009)Google Scholar
  48. 48.
    Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Zhou, X., Li, J., Shen, H., Kitsuregawa, M., Zhang, Y. (eds.) Frontiers of WWW Research and Development-APWeb 2006 SE-7. Lecture Notes in Computer Science, vol. 3841, pp. 63–75. Springer, Berlin Heidelberg (2006)CrossRefGoogle Scholar
  49. 49.
    Shokouhi, M., Si, L.: Federated search. Found. Trends Inform. Retrieval 5(1), 1–102 (2011)CrossRefGoogle Scholar
  50. 50.
    Si, L., Callan, J.: Modeling search engine effectiveness for federated search. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’05, pp. 83–92. ACM Press, New York (2005)Google Scholar
  51. 51.
    Stirling, P., Illien, G., Sanz, P., Sepetjan, S.: The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. In: World Library and Information Congress: 77th IFLA General Conference and Assembly (2011)Google Scholar
  52. 52.
    Thomas, P., Hawking, D.: Evaluating sampling methods for uncooperative collections. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pp. 503–512 (2007)Google Scholar
  53. 53.
    Tofel, B.: ‘Wayback’ for Accessing Web Archives. In: Proceedings of 7th International Web Archiving Workshop, IWAW ’07 (2007)Google Scholar
  54. 54.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089-HTTP framework for time-based access to resource states-Memento (2013). URL http://tools.ietf.org/html/rfc7089
  55. 55.
    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Tech. Rep. arXiv:0911.1112 (2009)
  56. 56.
    Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S., Sompel, H.V.D.: An HTTP-based versioning mechanism for linked data. In: Proceedings of the Linked Data on the Web Workshop, LDOW 2010 (2010) Google Scholar
  57. 57.
    Vlcek, I.: Identification and archiving of the Czech web outside the national domain. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)Google Scholar
  58. 58.
    Yan, H., Huang, L., Chen, C., Xie, Z.: A new data storage and service model of China web. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004)Google Scholar
  59. 59.
    Zhuge, H., Liu, J., Feng, L., Sun, X., He, C.: Query routing in a peer-to-peer semantic link network. Comput. Intell. 21(2), 197–216 (2005)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Ahmed AlSum
    • 1
  • Michele C. Weigle
    • 2
  • Michael L. Nelson
    • 2
  • Herbert Van de Sompel
    • 3
  1. 1.Stanford University LibrariesStanfordUSA
  2. 2.Department of Computer ScienceOld Dominion UniversityNorfolkUSA
  3. 3.Los Alamos National LaboratoryLos AlamosUSA

Personalised recommendations