Abstract
Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2014, which provide a comprehensive characterization on web archiving initiatives and their evolution. We identified several patterns and trends that highlight challenges and opportunities. We discuss these patterns and trends that enable to define strategies, estimate resources and provide guidelines for research and development of better technology. Our results show that during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
E.g., Internet Archive available at http://www.archive.org.
E.g., Library of Congress Web Archives available at http://www.loc.gov/minerva.
References
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The evolution of the web from a search engine perspective. In: Proc. of the 13th International Conference on World Wide Web, pp. 1–12 (2004)
Dellavalle, R., Hester, E., Heilig, L., Drake, A., Kuntzman, J., Graber, M., Schilling, L.: Going, going, gone: lost internet references. Science 302(5646), 787–788 (2003)
SalahEldeen, H., Nelson, M.: Losing my revolution: how many resources shared on social media have been lost? In: Theory and Practice of Digital Libraries, pp. 125–137 (2012)
UNESCO: Charter on the preservation of digital heritage. In: Adopted at the 32nd Session of the General Conference of UNESCO (2003). http://portal.unesco.org/ci/en/files/13367/10700115911Charter_en.pdf/Charter_en.pdf. Accessed 17 Oct 2003
UNESCO: Universal declaration on archives. In: Adopted at the ICA Annual General Meeting in Malta (2010). http://www.ica.org/6573/reference-documents/universal-declaration-on-archives.html. Accessed 17 Sept 2010
Kitsuregawa, M., Tamura, T., Toyoda, M., Kaji, N.: Socio-sense: a system for analysing the societal behavior from long term web archive. In: Proc. of the 10th Asia-Pacific Web Conference on Progress in WWW Research and Development, pp. 1–8 (2008)
Arms, W.Y., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L.: A research library based on the historical collections of the Internet Archive. D-Lib Mag. 12(2) (2006)
Arms, W., Huttenlocher, D., Kleinberg, J., Macy, M., Strang, D.: From Wayback Machine to Yesternet: new opportunities for social science. In: Proc. of the 2nd International Conference on e-Social Science (2006)
Ackland, R.: Virtual observatory for the study of online networks (VOSON)—progress and plans. In: Proc. of the 1st International Conference on e-Social Science (2005)
Foot, K., Schneider, S.: Web Campaigning. The MIT Press, Cambridge (2006)
Franklin, M.: Postcolonial Politics, the Internet, and Everyday Life: Pacific Traversals Online. Routledge (2004)
Gomes, D., Costa, M.: The importance of web archives for humanities. Int. J. Humanit. Arts Comput. 8(1), 106–123 (2014)
Yamamoto, Y., Tezuka, T., Jatowt, A., Tanaka, K.: Honto? Search: estimating trustworthiness of web information by search results aggregation and temporal analysis. In: Advances in Data and Web Management, pp. 253–264 (2007)
Chung, Y., Toyoda, M., Kitsuregawa, M.: A study of link farm distribution and evolution using a time series of web snapshots. In: Proc. of the 5th International Workshop on Adversarial Information Retrieval on the Web, pp. 9–16 (2009)
Elsas, J., Dumais, S.: Leveraging temporal dynamics of document content in relevance ranking. In: Proc. of the 3rd ACM International Conference on Web Search and Data Mining, pp. 1–10 (2010)
Radinsky, K., Horvitz, E.: Mining the web to predict future events. In: Proc. of the 6th ACM International Conference on Web Search and Data Mining, pp. 255–264 (2013)
Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: Proc. of the International Conference on Theory and Practice of Digital Libraries, pp. 408–420 (2011)
Costa, M., Couto, F.M., Silva, M.J.: Learning temporal-dependent ranking models. In: Proc. of the 37th Annual ACM SIGIR Conference (2014)
Masanès, J.: Web Archiving. Springer, New York (2006)
Kahle, B.: Wayback machine: now with 240,000,000,000 (2013). http://blog.archive.org/2013/01/09/updated-wayback/. Accessed 30 Apr 2016
Grotke, A.: IIPC—2008 member profile survey results. Technical report, International Internet Preservation Consortium (IIPC) (2008)
Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PloS One 9(12), 1–39 (2014)
Lazun, M.J.: “Link Rot” and legal resources on the web: a 2013 analysis by the chesapeake digital preservation group. Technical Report, The Chesapeake Digital Preservation Group (2013)
Tofel, B.: ‘Wayback’ for accessing web archives. In: Proc. of the 7th International Web Archiving Workshop (2007)
Jaffe, E., Kirkpatrick, S.: Architecture of the Internet Archive. In: Proc. of SYSTOR 2009: The Israeli Experimental Systems Conference, pp. 1–10 (2009)
Internet Memory Foundation: Web archiving in Europe. Technical Report, Internet Memory Foundation (2010)
Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012)
Ras, M., van Bussel, S.: Web archiving user survey. Technical Report, National Library of the Netherlands (Koninklijke Bibliotheek) (2007)
Costa, M., Silva, M.J.: Characterizing search behavior in web archives. In: Proc. of the 1st International Temporal Web Analytics Workshop, pp. 33–40 (2011)
Costa, M., Silva, M.J.: Evaluating web archive search systems. In: Proc. of the 13th International Conference on Web Information Systems Engineering, pp. 440–454 (2012)
Thomas, A., Meyer, E.T., Dougherty, M., Van den Heuvel, C., Madsen, C., Wyatt, S.: Researcher engagement with web archives: challenges and opportunities for investment. Technical Report, Joint Information Systems Committee (JISC) (2010)
Spaniol, M., Masanès, J., Baeza-Yates, R.: The 5th temporal web analytics workshop (tempweb’15). In: Proc. of the Companion Publication of the 24th International Conference on World Wide Web, pp. 863–864 (2015)
Spaniol, M., Masanès, J., Baeza-Yates, R.: The 4th temporal web analytics workshop (tempweb’14). In: Proc. of the Companion Publication of the 23rd International Conference on World Wide Web, pp. 863–864 (2014)
Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P., Zaragoza, H.: Searching through time in the New York Times. In: Proc. of the 4th Workshop on Human–Computer Interaction and Information Retrieval, pp. 41–44 (2010)
Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: interacting with the ephemeral web. In: Proc. of the 21st Annual ACM Symposium on User Interface Software and Technology, pp. 239–248 (2008)
Teevan, J., Dumais, S., Liebling, D., Hughes, R.: Changing how people view changes on the web. In: Proc. of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)
Masanès, J.: LiWA news #3: living web archives (2011). http://liwa-project.eu/images/videos/Liwa_Newsletter-3.pdf. Accessed March 2011
Weikum, G., Ntarmos, N., Spaniol, M., Triantafillou, P., Benczur, A.A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal analytics on web archive data: it’s about time! In: Proc. of the 5th Conference on Innovative Data Systems Research, pp. 199–202 (2011)
Huurdeman, H.C., Ben-David, A., Sammar, T.: Sprint methods for web archive research. In: Proc. of the 5th Annual ACM Web Science Conference, pp. 182–190 (2013)
Risse, T., Peters, W.: ARCOMEM: from collect-all ARchives to COmmunity MEMories. In: Proc. of the 21st International Conference Companion on World Wide Web, pp. 275–278 (2012)
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the web. CoRR (2009). arXiv:0911.1112
Burner, M., Kahle, B.: Arc file format (1996). http://www.archive.org/web/researcher/ArcFileFormat.php. Accessed Sept 1996
NDSA Content Working Group: Web archiving survey report. Technical Report, National Digital Stewardship Alliance (2012)
Bailey, J., Grotke, A., Hanna, K., Hartman, C., McCain, E., Moffatt, C., Taylor, N.: Web archiving in the United States: a 2013 survey. Technical Report, National Digital Stewardship Alliance (2014)
Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proc. of the 11th Annual International ACM/IEEE joint Conference on Digital Libraries, pp. 133–136 (2011)
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)
ISO 28500:2009: Information and documentation—WARC file format (2009). http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717. Accessed 30 Apr 2016
IIPC: Internet Archive ARC access tools (2009). http://archive-access.sourceforge.net/. Accessed 30 Apr 2016
Acknowledgments
This work could not have been done without the support of the Portuguese Web Archive team. We also thank FCT for the financial support of the Research Units of LaSIGE (PEst-OE/EEI/UI0408/2014) and INESC-ID (UID/CEC/50021/2013), and the DataStorm Research Line of Excellency (EXCL/EEI-ESS/0257/2012).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Costa, M., Gomes, D. & Silva, M.J. The evolution of web archiving. Int J Digit Libr 18, 191–205 (2017). https://doi.org/10.1007/s00799-016-0171-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-016-0171-9