Who and What Links to the Internet Archive

  • Yasmin Alnoamany
  • Ahmed Alsum
  • Michele C. Weigle
  • Michael L. Nelson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8092)

Abstract

The Internet Archive’s (IA) Wayback Machine is the largest and oldest public web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to web archives because they do not find the requested pages on the live web. About 65% of the requested archived pages no longer exist on the live web. We find that more than 82% of human sessions connect to the Wayback Machine via referrals from other web sites, while only 15% of robots have referrers. Most of the links (86%) from websites are to individual archived pages at specific points in time, and of those 83% no longer exist on the live web.

Keywords

Web Archiving Web Server Logs Web Usage Mining Language Detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013 (2013)Google Scholar
  2. 2.
    Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: towards an understanding of the web’s decay. In: Proceedings of the 13th International Conference on World Wide Web, WWW 2004, pp. 328–337. ACM (2004)Google Scholar
  3. 3.
    Carmel, D., Yom-Tov, E., Roitman, H.: Enhancing digital libraries using missing content analysis. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2008, pp. 1–10. ACM (2008)Google Scholar
  4. 4.
    Costa, M.J. Silva, M.: Characterizing Search Behavior in Web Archives. In: Proceedings of Temporal Web Analytics Workshop. TWAW (2011)Google Scholar
  5. 5.
    Costa, M., Silva, M.J.: Understanding the information needs of web archive users. In: Proc. of the 10th International Web Archiving Workshop, pp. 9–16 (September 2010)Google Scholar
  6. 6.
    Fukuda, K., Cho, K., Esaki, H.: The impact of residential broadband traffic on Japanese ISP backbones. SIGCOMM Comput. Commun. Rev. 35(1), 15–22 (2005)CrossRefGoogle Scholar
  7. 7.
    Harrison, T.L., Nelson, M.L.: Just-in-time recovery of missing web pages. In: Proceedings of the 17th Conference on Hypertext and Hypermedia, HYPERTEXT 2006, pp. 145–156. ACM (2006)Google Scholar
  8. 8.
    Kahle, B.: Wayback Machine: Now with 240,000,000,000 (January 2013), http://blog.archive.org/2013/01/09/updated-wayback/
  9. 9.
    Krzywinski, M.I., Schein, J.E., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., Marra, M.A.: Circos: An information aesthetic for comparative genomics. Genome Research (2009)Google Scholar
  10. 10.
    Markov, Z., Larose, D.T.: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. John Wiley & Sons, Inc. (2007)Google Scholar
  11. 11.
    Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting (2010), http://1.usa.gov/XSjDG8
  12. 12.
    Padia, K., AlNoamany, Y., Weigle, M.C.: Visualizing digital collections at archive-it. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 15–18 (2012)Google Scholar
  13. 13.
    Reddy, K.S., Varma, G.P.S., Babu, I.R.: Preprocessing the web server logs: an illustrative approach for effective usage mining. ACM SIGSOFT Software Engineering Notes 37(3), 1–5 (2012)Google Scholar
  14. 14.
    Reisinger, D.: Netflix gobbles a third of peak Internet traffic in North America. CNET(2012), http://goo.gl/2cVPg
  15. 15.
    Shuyo, N.: Language Detection Library for Java (2012), http://code.google.com/p/language-detection/
  16. 16.
    Silva, A.J.C., Gonçalves, M.A., Laender, A.H.F., Modesto, M.A.B., Cristo, M., Ziviani, N.: Finding what is missing from a digital library: A case study in the computer science field. Inf. Process. Manage. 45(3), 380–391 (2009)CrossRefGoogle Scholar
  17. 17.
    Thelwall, M., Vaughan, L.: A fair history of the web? examining country balance in the internet archive. Library & Information Science Research 26(2), 162–176 (2004)CrossRefGoogle Scholar
  18. 18.
    Tofel, B.: Wayback for Accessing Web Archives. In: Proceedings of International Web Archiving Workshop. IWAW (2007)Google Scholar
  19. 19.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states – Memento (2012), https://datatracker.ietf.org/doc/draft-vandesompel-memento/
  20. 20.
    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Technical Report arXiv:0911.1112 (2009)Google Scholar
  21. 21.
    Wasserman, T.: Netflix takes up 32.7% of Internet bandwidth. Marshable (2011), http://goo.gl/2FtWa
  22. 22.
    Zhuang, Z., Wagle, R., Giles, C.: What’s there and what’s not?: focused crawling for missing documents in digital libraries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2005, pp. 301–310 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yasmin Alnoamany
    • 1
  • Ahmed Alsum
    • 1
  • Michele C. Weigle
    • 1
  • Michael L. Nelson
    • 1
  1. 1.Department of Computer ScienceOld Dominion UniversityNorfolkUSA

Personalised recommendations