A Framework for Web Archiving and Guaranteed Retrieval

Devendran, A.; Arunkumar, K.

doi:10.1007/978-981-13-9364-8_16

A. Devendran¹⁷ &
K. Arunkumar¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1016))

1281 Accesses
6 Altmetric

Abstract

As of today, ‘web.archive.org’ has more than 338 billion web pages archived. How many of those pages are 100% retrieval. How many of the pages were left out or ignored just because the page doesn’t have some compatibility issue? How many of them were vernacular language and encoded in different formats (before UNICODE is standardized)? If we are talking about the content-type text. Consider other mime types which were encoded and decoded with different algorithms. The fundamental reason for this lies with the fundamental representation of digital data. We all know a sequence of 0 s and 1 s doesn’t make proper sense unless it is decoded properly. At the time of archiving, the browsers which could have rendered properly might have gone obsolete or upgraded way beyond to recognize old formats or the browser platforms could have been upgraded to recognize old formats. We studied various data preservation, web archiving related works and proposed a new framework that could store the exact client browser details (user-agent) in the WARC record and use it to load corresponding browser @ client side and render the archived content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arunkumar, K., & Devendran, A. (2019). Digital data preservation—a viable solution. In V. Balas, N. Sharma, & A. Chakrabarti (Eds.), Data management, analytics and innovation. Advances in intelligent systems and computing (Vol. 808). Singapore: Springer.
Google Scholar
Ainsworth, S. G., Nelson, M. L., & Van de Sompel, H. (2015). Only one out of five archived web pages existed as presented. In HT 2015 Proceedings of the 26th ACM Conference on Hypertext & Social Media (pp. 257–266).
Google Scholar
Alam, S., Kelly, M., Weigle, M. C., & Nelson, M. L. (2017). Client-side reconstruction of composite mementos using serviceworker. In JCDL 2017 Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (pp. 237–240).
Google Scholar
Gomes, D., Miranda, J., & Costa M. (2011). A survey on web archiving initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and advanced technology for digital libraries. TPDL 2011. Lecture Notes in Computer Science (Vol. 6966). Berlin, Heidelberg: Springer.
Google Scholar
https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.
Costa, M., Gomes, D., Couto, F. M., & Silva, M. J. (2013). A survey of web archive search architectures. In WWW 2013 Companion Proceedings of the 22nd International Conference on World Wide Web (pp. 1045–1050).
Google Scholar
Kelly, M., Brunelle, J. F., Weigle, M. C., & Nelson, M. L. (2013). A method for identifying personalized representations in web archives. In D-Lib magazine November/December 2013 (Vol. 19, No. 11/12).
Google Scholar
Banos, V., & Manolopoulos, Y. (2015). A quantitative approach to evaluate Website Archivability using the CLEAR+ method. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-015-0144-4.
Article Google Scholar
Kelly, M., & Nelson, M. & Weigle, M. (2018). A framework for aggregating private and public web archives (pp. 273–282). https://doi.org/10.1145/3197026.3197045.
Old browsers—a open source tool with remote & containerized browser system by oldweb-today. https://github.com/oldweb-today/browsers.
WebRecorder pywb 2.0—core python web archiving toolkit for replay and recording of web archives. https://github.com/webrecorder/pywb.
Turbo.net—a Cloud infrastructure to run instantly on all your desktops, mobile devices applications remotely. https://turbo.net/.
WARC format 1.1—WARC (Web ARChive) file format for archiving websites and web data. https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.
RFC 7089—HTTP framework for time-based access to resource states—Memento. https://tools.ietf.org/html/rfc7089.
RFC 1945—HTTP with user-agent specification. https://tools.ietf.org/html/rfc1945.

Download references

Author information

Authors and Affiliations

Dr. M.G.R. Education and Research Institute, Chennai, India
A. Devendran
Technical Architect, Ppltech, Chennai, India
K. Arunkumar

Authors

A. Devendran
View author publications
You can also search for this author in PubMed Google Scholar
K. Arunkumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Devendran .

Editor information

Editors and Affiliations

Society for Data Science, Pune, Maharashtra, India
Neha Sharma
A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, West Bengal, India
Amlan Chakrabarti
Department of Automatics and Applied Software, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Devendran, A., Arunkumar, K. (2020). A Framework for Web Archiving and Guaranteed Retrieval. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_16

Download citation

DOI: https://doi.org/10.1007/978-981-13-9364-8_16
Published: 25 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9363-1
Online ISBN: 978-981-13-9364-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics