Advertisement

Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation

  • Andrés Sanoja
  • Stéphane Gançarski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10509)

Abstract

Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.

Keywords

Migration Web Segmentation Blocks HTML5 Web archive Format obsolescence 

References

  1. 1.
    Cao, J., Mao, B., Luo, J.: A segmentation method for web page analysis using shrinking and dividing. Int. J. Parallel Emergent Distrib. Syst. 25(2), 93–104 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Garret, J.: Preserving digital information. Technical report, Commission on Preservation and Access and the Research Libraries Group (1996)Google Scholar
  3. 3.
    Jackson, A.N.: Formats over time: exploring UK web history. CoRR, pp. 1210–1714 (2012)Google Scholar
  4. 4.
    Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1173–1182, New York, NY, USA. ACM (2008)Google Scholar
  5. 5.
    Laws, B.: Seriously, another format? You must be kidding. CSE News 36(2), 41 (2013)Google Scholar
  6. 6.
    Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 588–593, Edmonton, Alberta, Canada. ACM (2002). ISBN: 1-58113-567-X. doi: 10.1145/775047.775134
  7. 7.
    Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22Nd International Conference on Computational Linguistics, vol. 1, COLING 2008, pp. 593–600, Stroudsburg, PA, USA. Association for Computational Linguistics (2008). ISBN: 978-1-905593-44-6, http://dl.acm.org/citation.cfm?id=1599081.1599156
  8. 8.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi: 10.1145/375360.375365. ISSN: 0360-0300CrossRefGoogle Scholar
  9. 9.
    Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, pp. 13–15 (2013)Google Scholar
  10. 10.
    Park, S.H., Lynberg, N., Racer, J., McElmurray, P., Fox, E.A.: Html5 etds. In: Proceedings of International Symposium on Electronic thesis and Dissertations, Austin, TX, USA (2010)Google Scholar
  11. 11.
    Rosenthal, D.S.H., Lipkis, T., Robertson, T., Morabito, S.: Transparent format migration of preserved web content. D-Lib Mag. 11(1) (2005). http://dblp.uni-trier.de/db/journals/dlib/dlib11.html#RosenthalLRM05
  12. 12.
    Rosenthal, D.S.H.: Format obsolescence: assessing the threat and the defenses. Libr. Hi Tech 28(2), 195–210 (2010)CrossRefGoogle Scholar
  13. 13.
    Sanoja, A.: Web page segmentation, evaluation and applications. PhD thesis, Université Pierre et Marie Curie-Paris VI (2015). https://hal.inria.fr/tel-01128002/
  14. 14.
    Sanoja, A., Gançarski, S.: Block-o-matic: a web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600, Marrakesh, Moroco, April 2014Google Scholar
  15. 15.
    Sanoja, A., Gançarski, S.: Web page segmentation evaluation. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 753–760. ACM (2015)Google Scholar
  16. 16.
    Solis, B.: The conversation prism (2014). https://conversationprism.com
  17. 17.
    Van der Hoeven, J.: Emulation for digital preservation in practice: the results. Int. J. Dig. Curation 2(2), 123–132 (2007)CrossRefGoogle Scholar
  18. 18.
    W3Schools.com. HTML5 Migration: Migration from HTML4 to HTML5. W3Schools (2016). http://www.w3schools.com/html/html5_migration.asp

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Escuela de ComputaciónUniversidad Central de VenezuelaCaracasVenezuela
  2. 2.Laboratoire d’Informatique de Paris 6Université Pierre et Marie CurieParisFrance

Personalised recommendations