Skip to main content

A Framework for Web Archiving and Guaranteed Retrieval

  • Conference paper
  • First Online:
Data Management, Analytics and Innovation

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1016))

Abstract

As of today, ‘web.archive.org’ has more than 338 billion web pages archived. How many of those pages are 100% retrieval. How many of the pages were left out or ignored just because the page doesn’t have some compatibility issue? How many of them were vernacular language and encoded in different formats (before UNICODE is standardized)? If we are talking about the content-type text. Consider other mime types which were encoded and decoded with different algorithms. The fundamental reason for this lies with the fundamental representation of digital data. We all know a sequence of 0 s and 1 s doesn’t make proper sense unless it is decoded properly. At the time of archiving, the browsers which could have rendered properly might have gone obsolete or upgraded way beyond to recognize old formats or the browser platforms could have been upgraded to recognize old formats. We studied various data preservation, web archiving related works and proposed a new framework that could store the exact client browser details (user-agent) in the WARC record and use it to load corresponding browser @ client side and render the archived content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arunkumar, K., & Devendran, A. (2019). Digital data preservation—a viable solution. In V. Balas, N. Sharma, & A. Chakrabarti (Eds.), Data management, analytics and innovation. Advances in intelligent systems and computing (Vol. 808). Singapore: Springer.

    Google Scholar 

  2. Ainsworth, S. G., Nelson, M. L., & Van de Sompel, H. (2015). Only one out of five archived web pages existed as presented. In HT 2015 Proceedings of the 26th ACM Conference on Hypertext & Social Media (pp. 257–266).

    Google Scholar 

  3. Alam, S., Kelly, M., Weigle, M. C., & Nelson, M. L. (2017). Client-side reconstruction of composite mementos using serviceworker. In JCDL 2017 Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (pp. 237–240).

    Google Scholar 

  4. Gomes, D., Miranda, J., & Costa M. (2011). A survey on web archiving initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and advanced technology for digital libraries. TPDL 2011. Lecture Notes in Computer Science (Vol. 6966). Berlin, Heidelberg: Springer.

    Google Scholar 

  5. https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.

  6. Costa, M., Gomes, D., Couto, F. M., & Silva, M. J. (2013). A survey of web archive search architectures. In WWW 2013 Companion Proceedings of the 22nd International Conference on World Wide Web (pp. 1045–1050).

    Google Scholar 

  7. Kelly, M., Brunelle, J. F., Weigle, M. C., & Nelson, M. L. (2013). A method for identifying personalized representations in web archives. In D-Lib magazine November/December 2013 (Vol. 19, No. 11/12).

    Google Scholar 

  8. Banos, V., & Manolopoulos, Y. (2015). A quantitative approach to evaluate Website Archivability using the CLEAR+ method. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-015-0144-4.

    Article  Google Scholar 

  9. Kelly, M., & Nelson, M. & Weigle, M. (2018). A framework for aggregating private and public web archives (pp. 273–282). https://doi.org/10.1145/3197026.3197045.

  10. Old browsers—a open source tool with remote & containerized browser system by oldweb-today. https://github.com/oldweb-today/browsers.

  11. WebRecorder pywb 2.0—core python web archiving toolkit for replay and recording of web archives. https://github.com/webrecorder/pywb.

  12. Turbo.net—a Cloud infrastructure to run instantly on all your desktops, mobile devices applications remotely. https://turbo.net/.

  13. WARC format 1.1—WARC (Web ARChive) file format for archiving websites and web data. https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.

  14. RFC 7089—HTTP framework for time-based access to resource states—Memento. https://tools.ietf.org/html/rfc7089.

  15. RFC 1945—HTTP with user-agent specification. https://tools.ietf.org/html/rfc1945.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Devendran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Devendran, A., Arunkumar, K. (2020). A Framework for Web Archiving and Guaranteed Retrieval. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-9364-8_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-9363-1

  • Online ISBN: 978-981-13-9364-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics