Abstract
Together with a number of national libraries, the Internet Archive committed itself in 2003 to international collaboration to create open source tools and standardized formats for web archiving. This project was motivated by our experience as home to over 100 billion archived web resources dating back to 1996, and as a partner to memory institutions building thematic web archives. Resulting tools include the Heritrix archival web crawler/harvester, the Wayback archive browsing service, and the NutchWAX archive full-text index and query utilities. A standard ingest/archival format for web resources called WARC has also been developed. Software with full source code is free to download and reuse, and organizations worldwide have adopted and contributed to these tools. Working with large collections remains a challenge, and the web itself is constantly growing and changing, so we continue to seek international cooperation to expand and improve this web archive tool set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mohr, G. (2007). Archival Tools to Match the Web: Open, International, Comprehensive. In: Goh, D.HL., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. ICADL 2007. Lecture Notes in Computer Science, vol 4822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77094-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-77094-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77093-0
Online ISBN: 978-3-540-77094-7
eBook Packages: Computer ScienceComputer Science (R0)