A quantitative approach to evaluate Website Archivability using the CLEAR+ method

Abstract

Website Archivability (WA) is a notion established to capture the core aspects of a website, crucial in diagnosing whether it has the potential to be archived with completeness and accuracy. In this work, aiming at measuring WA, we introduce and elaborate on all aspects of CLEAR+, an extended version of the Credible Live Evaluation Method for Archive Readiness (CLEAR) method. We use a systematic approach to evaluate WA from multiple different perspectives, which we call Website Archivability Facets. We then analyse archiveready.com, a web application we created as the reference implementation of CLEAR+, and discuss the implementation of the evaluation workflow. Finally, we conduct thorough evaluations of all aspects of WA to support the validity, the reliability and the benefits of our method using real-world web data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    The numbers reported in this paragraph are from the Daily Estimated Size of the World Wide Web, http://www.worldwidewebsize.com/, January 2014.

  2. 2.

    https://drupal.org/.

  3. 3.

    https://drupalcamp.stanford.edu/.

  4. 4.

    http://archive.org.

  5. 5.

    http://archiveitmeeting2013.wordpress.com/.

  6. 6.

    http://webarchiving2013.wordpress.com/.

  7. 7.

    Personal communication.

  8. 8.

    https://library.columbia.edu/bts/web_resources_collection/proposal_examples.html.

  9. 9.

    Personal communication.

  10. 10.

    http://netpreserve.org.

  11. 11.

    http://blog.archive-it.org/2014/03/13/introducing-archive-it-4-9-and-umbra/.

  12. 12.

    http://webcurator.sourceforge.net/.

  13. 13.

    http://webcurator.sourceforge.net/docs/1.5.2/Web%20Curator%20Tool%20User%20Manual%20(WCT%201.5.2).pdf.

  14. 14.

    http://www.netpreserve.org/sites/default/files/.../CompleteCrowdsourcing.pdf.

  15. 15.

    http://netpreserve.org/sites/default/files/attachments/CrowdsourcingWebArchiving_WorkshopReport.pdf.

  16. 16.

    http://www.archive-it.org/.

  17. 17.

    https://webarchive.jira.com/wiki/display/ARIH/Test+Crawls.

  18. 18.

    http://www.robotstxt.org/.

  19. 19.

    http://www.digitalpreservation.gov/formats/sustain/sustain.shtml.

  20. 20.

    http://validator.w3.org/.

  21. 21.

    http://jigsaw.w3.org/css-validator/.

  22. 22.

    http://tika.apache.org/.

  23. 23.

    http://code.google.com/p/pyxmlcheck/.

  24. 24.

    http://tool.motoricerca.info/robots-checker.phtml.

  25. 25.

    http://httparchive.org/.

  26. 26.

    http://www.paradigm.ac.uk/workbook/preservation-strategies/selecting-other.html.

  27. 27.

    http://www.activearchive.com/content/what-about-metadata.

  28. 28.

    http://www.bbc.co.uk/news/10628494.

  29. 29.

    http://www.auth.gr/ as of 10 August 2014.

  30. 30.

    http://www.archiveready.com.

  31. 31.

    http://www.debian.org.

  32. 32.

    http://www.nginx.org.

  33. 33.

    http://www.python.org/.

  34. 34.

    http://gunicorn.org/.

  35. 35.

    http://www.crummy.com/software/BeautifulSoup/.

  36. 36.

    http://flask.pocoo.org/.

  37. 37.

    http://redis.io.

  38. 38.

    http://www.mariadb.com.

  39. 39.

    http://phantomjs.org/.

  40. 40.

    http://www.jquery.com.

  41. 41.

    http://twitter.github.com/bootstrap/.

  42. 42.

    http://validator.w3.org/.

  43. 43.

    http://jigsaw.w3.org/css-validator/.

  44. 44.

    http://validator.w3.org/.

  45. 45.

    http://validator.w3.org/feed/.

  46. 46.

    http://jigsaw.w3.org/css-validator/.

  47. 47.

    http://httparchive.org/trends.php.

  48. 48.

    http://python-rq.org/.

  49. 49.

    https://github.com/vbanos/web-archivability-journal-paper-data-2014.

  50. 50.

    http://reference.sitepoint.com/css/vendorspecific.

  51. 51.

    http://www.w3.org/Style/CSS/.

  52. 52.

    http://delab.csd.auth.gr/.

  53. 53.

    http://s3.amazonaws.com/alexa-static/top-1m.csv.zip.

References

  1. 1.

    Abou-Zahra, S., Squillace, M.: Evaluation and report language (earl) 1.0 schema. http://www.w3.org/TR/EARL10-Schema/ (2006). Accessed 22 Dec 2014

  2. 2.

    Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136. ACM (2011)

  3. 3.

    Arms, C., Fleischhauer, C., Murray, K.: Sustainability of digital formats planning for Library of Congress collections: external dependencies. http://www.digitalpreservation.gov/formats/sustain/sustain.shtml#external (2013). Accessed 22 Dec 2014

  4. 4.

    Avižienis, A., Laprie, J.C., Randell, B.: Fundamental concepts of computer system dependability. In: Proceedings of the IARP/IEEE-RAS Workshop on Robot Dependability: Technological Challenge of Dependable, Robots in Human Environments (2001)

  5. 5.

    Banos, V., Baltas, N., Manolopoulos, Y.: Trends in blog preservation. In: Proceedings of the 14th International Conference on Enterprise Information Systems (ICEIS). Wroclaw, Poland (2012)

  6. 6.

    Banos, V., Kim, Y., Ross, S., Manolopoulos, Y.: CLEAR: a credible method to evaluate website archivability. In: Proceedings of the 10th International Conference on Preservation of Digital Objects (IPRES). Lisbon, Portugal (2013)

  7. 7.

    Brickley, D., Miller, L.: FOAF vocabulary specification 0.98. Namespace Document 9 (2010)

  8. 8.

    Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: Measuring the impact of missing resources. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 321–330. IEEE (2014)

  9. 9.

    Campbell, L.: Learning object metadata, curation reference manual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/learning-object-metadata (2007). Accessed 22 Dec 2014

  10. 10.

    Caplan, P.: Preservation metadata, curation reference manual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/preservation-metadata (2006). Accessed 22 Dec 2014

  11. 11.

    Center, M.D.: Mozilla’s quirks mode. 2007 (2008)

  12. 12.

    Charron, C., Favier, J., Li, C., Joseph, J., Neurauter, M., Cohen, S., McHarg, T., Kolko, J.: Social computing: how networks erode institutional power, and what to do about it. Forrester Customer Report (2006)

  13. 13.

    Clausen, L.: Concerning etags and datestamps. In: 4th International Web Archiving Workshop (IWAW04). Citeseer (2004)

  14. 14.

    Coalition, D.P.: Institutional strategies—standards and best practice guidelines. http://www.dpconline.org/advice/preservationhandbook/institutional-strategies/standards-and-best-practice-guidelines (2012). Accessed 22 Dec 2014

  15. 15.

    Crane, G.: Designing documents to enhance the performance of digital libraries. Time, space, people and a digital library on London. D-Lib Mag. 6(7/8) (2000)

  16. 16.

    Daskalantonakis, M.: A practical view of software measurement and implementation experiences within motorola. IEEE Trans. Softw. Eng. 18(11), 998–1010 (1992)

    Article  Google Scholar 

  17. 17.

    Day, M.: Metadata, curation reference manual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/metadata (2005). Accessed 22 Dec 2014

  18. 18.

    Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: The SHARC framework for data quality in web archiving. VLDB J. 20(2), 183–207 (2011)

    Article  Google Scholar 

  19. 19.

    Donnelly, M.: JSTOR/Harvard Object Validation Environment (JHOVE). Digital Curation Centre Case Studies and Interviews (2006)

  20. 20.

    Duff, W., van Ballegooie, M.: Archival metadata, curation reference manual. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/archival-metadata (2006). Accessed 22 Dec 2014

  21. 21.

    Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceedings of the 21st International Conference Companion on World Wide Web (WWW), pp. 127–132. Lyon, France (2012)

  22. 22.

    Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T.: Hypertext transfer protocol-http/1.1. http://tools.ietf.org/html/rfc2616 (1999). Accessed 22 Dec 2014

  23. 23.

    Freire, A.P., Bittar, T.J., Fortes, R.P.: An approach based on metrics for monitoring web accessibility in Brazilian municipalities web sites. In: Proceedings of the 23rd ACM Symposium on Applied Computing (SAC), pp. 2421–2425. Fortaleza, Brazil (2008)

  24. 24.

    Glenn, V.D.: Preserving government and political information: the web-at-risk project. First Monday 12(7) (2007)

  25. 25.

    Gomes, D., Silva, M.J.: Modelling information persistence on the web. In: Proceedings of the 6th International Conference on Web Engineering, pp. 193–200. ACM (2006)

  26. 26.

    Gray, G., Martin, S.: Choosing a sustainable web archiving method: a comparison of capture quality. D-Lib Mag. 19(5), 2 (2013)

    Google Scholar 

  27. 27.

    He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM), pp. 355–364. Rome, Italy (2013)

  28. 28.

    Hockx-Yu, H., Crawford, L., Coram, R., Johnson, S.: Capturing and replaying streaming media in a web archive—a British Library case study. In: Proceedings of the 7th International Conference on Preservation of Digital Objects (iPres). Vienna, Austria (2010)

  29. 29.

    ISO: 28500: 2009 information and documentation-WARC file format. International Organization for Standardization (2009)

  30. 30.

    Kasioumis, N., Banos, V., Kalb, H.: Towards building a blog preservation platform. World Wide Web 17(4), 799–825 (2013)

    Article  Google Scholar 

  31. 31.

    Kelly, D.: Methods for evaluating interactive information retrieval systems with users. In: Foundations and Trends in Information Retrieval, vol. 3. Now Publishers Inc., Hanover (2009)

  32. 32.

    Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 35–47. Valletta, Malta (2013)

  33. 33.

    Kelly, M., Nelson, M.L., Weigle, M.C.: The archival acid test: evaluating archive performance on advanced html and javascript. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 25–28. IEEE (2014)

  34. 34.

    Kenney, A.R., McGovern, N., Botticelli, P., Entlich, R., Lagoze, C., Payette, S.: Preservation risk management for web resources. D-Lib Mag 8(1) (2002)

  35. 35.

    de Kunder, M.: Geschatte grootte van het geïndexeerde world wide web. Tilburg University, p. 63 (2008)

  36. 36.

    Lavoie, B.F.: Implementing metadata in digital preservation systems: the premis activity. D-Lib Mag. 10(4) (2004)

  37. 37.

    Liu, N.C., Cheng, Y.: The academic ranking of world universities. High. Educ. Eur. 30(2), 127–136 (2005)

    Article  Google Scholar 

  38. 38.

    Lowry, R.: Concepts and Applications of Inferential Statistics. Lowry, Richard (1998)

    Google Scholar 

  39. 39.

    McBride, B., et al.: The resource description framework (RDF) and its vocabulary description language RDFS. In: Handbook on Ontologies, pp. 51–66. Springer, New York (2004)

  40. 40.

    Mendes, E., Mosley, N., Counsell, S.: Web metrics-estimating design and authoring effort. IEEE Multimed. 8(1), 50–57 (2001)

    Article  MATH  Google Scholar 

  41. 41.

    Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to heritrix. In: Proceedings of the 4th International Web Archiving Workshop (IWAW). Vienna, Austria (2004)

  42. 42.

    Morrissey, S., Meyer, J., Bhattarai, S., Kurdikar, S., Ling, J., Stoeffler, M., Thanneeru, U.: Portico: A case study in the use of xml for the long-term preservation of digital artifacts. In: International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada (2010)

  43. 43.

    Niu, J.: An overview of web archiving. D-Lib Mag. 18(3), 2 (2012)

    Google Scholar 

  44. 44.

    Olsina, L., Rossi, G.: Measuring web application quality with WebQEM. IEEE Multimed. 9(4), 20–29 (2002)

    Article  Google Scholar 

  45. 45.

    Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics: Adapting to Change in Content, Size, Topology and Use, pp. 153–177. Springer, New York (2004)

  46. 46.

    Parmanto, B., Zeng, X.: Metric for web accessibility evaluation. J. Am. Soc. Inf. Sci. Technol. 56(13), 1394–1404 (2005)

    Article  Google Scholar 

  47. 47.

    Paynter, G., Joe, S., Lala, V., Lee, G.: A year of selective web archiving with the web curator tool at the National Library of New Zealand. D-Lib Mag. 14(5), 2 (2008)

    Google Scholar 

  48. 48.

    Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Proceedings of the 6th International Conference on Preservation of Digital Objects (IPres). San Francisco, CA (2009)

  49. 49.

    Pennock, M., Kelly, B.: Archiving web site resources: a records management view. In: Proceedings of the 15th International Conference on World Wide Web (WWW), pp. 987–988. Edinburgh, UK (2006)

  50. 50.

    Press, N.: Understanding metadata. National Information Standards 20 (2004)

  51. 51.

    Reyes Ayala, B., Phillips, M.E., Ko, L.: Current quality assurance practices in web archiving. http://digital.library.unt.edu/ark:/67531/metadc333026/ (2013). Accessed 22 Dec 2014

  52. 52.

    Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P.: Exploiting the social and semantic web for guided web archiving. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 426–432. Paphos, Cyprus (2012)

  53. 53.

    Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp. 991–1000. Madrid, Spain (2009)

  54. 54.

    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW), pp. 19–26. Madrid, Spain (2009)

  55. 55.

    Sullivan, T., Matson, R.: Barriers to use: usability and content accessibility on the web’s most popular sites. In: Proceedings on the ACM Conference on Universal Usability (CUU), pp. 139–144 (2000)

  56. 56.

    Voorhees, E., Harman, D.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

  57. 57.

    W3C: W3C HTML validation service (2001)

  58. 58.

    Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Internet Eng. Task Force RFC 2413, 222 (1998)

    Google Scholar 

  59. 59.

    Yang, S., Chitturi, K., Wilson, G., Magdy, M., Fox, E.A.: A study of automation from seed URL generation to focused web archive development: the CTRnet context. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 341–342. Washington, DC (2012)

Download references

Acknowledgments

We would like to thank our colleagues: Panagiotis Symeonidis, Georgia Latsiou and Konstantinos Mokos, for their assistance in Sect. 5.3, Evaluation by experts. We would like to thank the anonymous reviewers for their valuable input, which helped us to significantly improve this manuscript. In particular, their feedback was critical to improve Sect. 3 on the CLEAR+ method and Sect. 5 on the experimental evaluation.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Vangelis Banos.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Banos, V., Manolopoulos, Y. A quantitative approach to evaluate Website Archivability using the CLEAR+ method. Int J Digit Libr 17, 119–141 (2016). https://doi.org/10.1007/s00799-015-0144-4

Download citation

Keywords

  • Web archiving
  • Website Archivability
  • Web harvesting