Skip to main content

Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation

  • Conference paper
  • First Online:
Book cover Advances in Databases and Information Systems (ADBIS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

Abstract

Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    web.archive.org/.

  2. 2.

    Googling the term “translating html 4 tag to html5” will give these references.

  3. 3.

    https://github.com/asanoja/web-segmentation-evaluation/tree/master/chrome-extensions/MOB.

  4. 4.

    https://github.com/asanoja/web-segmentation-evaluation/tree/master/chrome-extensions/BOM.

  5. 5.

    https://github.com/asanoja/web-segmentation-evaluation/tree/master/dataset.

  6. 6.

    Usually a tree.

  7. 7.

    http://www-poleia.lip6.fr/~sanojaa/BOM/inventory.

References

  1. Cao, J., Mao, B., Luo, J.: A segmentation method for web page analysis using shrinking and dividing. Int. J. Parallel Emergent Distrib. Syst. 25(2), 93–104 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  2. Garret, J.: Preserving digital information. Technical report, Commission on Preservation and Access and the Research Libraries Group (1996)

    Google Scholar 

  3. Jackson, A.N.: Formats over time: exploring UK web history. CoRR, pp. 1210–1714 (2012)

    Google Scholar 

  4. Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1173–1182, New York, NY, USA. ACM (2008)

    Google Scholar 

  5. Laws, B.: Seriously, another format? You must be kidding. CSE News 36(2), 41 (2013)

    Google Scholar 

  6. Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 588–593, Edmonton, Alberta, Canada. ACM (2002). ISBN: 1-58113-567-X. doi:10.1145/775047.775134

  7. Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22Nd International Conference on Computational Linguistics, vol. 1, COLING 2008, pp. 593–600, Stroudsburg, PA, USA. Association for Computational Linguistics (2008). ISBN: 978-1-905593-44-6, http://dl.acm.org/citation.cfm?id=1599081.1599156

  8. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi:10.1145/375360.375365. ISSN: 0360-0300

    Article  Google Scholar 

  9. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, pp. 13–15 (2013)

    Google Scholar 

  10. Park, S.H., Lynberg, N., Racer, J., McElmurray, P., Fox, E.A.: Html5 etds. In: Proceedings of International Symposium on Electronic thesis and Dissertations, Austin, TX, USA (2010)

    Google Scholar 

  11. Rosenthal, D.S.H., Lipkis, T., Robertson, T., Morabito, S.: Transparent format migration of preserved web content. D-Lib Mag. 11(1) (2005). http://dblp.uni-trier.de/db/journals/dlib/dlib11.html#RosenthalLRM05

  12. Rosenthal, D.S.H.: Format obsolescence: assessing the threat and the defenses. Libr. Hi Tech 28(2), 195–210 (2010)

    Article  Google Scholar 

  13. Sanoja, A.: Web page segmentation, evaluation and applications. PhD thesis, Université Pierre et Marie Curie-Paris VI (2015). https://hal.inria.fr/tel-01128002/

  14. Sanoja, A., Gançarski, S.: Block-o-matic: a web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600, Marrakesh, Moroco, April 2014

    Google Scholar 

  15. Sanoja, A., Gançarski, S.: Web page segmentation evaluation. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 753–760. ACM (2015)

    Google Scholar 

  16. Solis, B.: The conversation prism (2014). https://conversationprism.com

  17. Van der Hoeven, J.: Emulation for digital preservation in practice: the results. Int. J. Dig. Curation 2(2), 123–132 (2007)

    Article  Google Scholar 

  18. W3Schools.com. HTML5 Migration: Migration from HTML4 to HTML5. W3Schools (2016). http://www.w3schools.com/html/html5_migration.asp

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrés Sanoja .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sanoja, A., Gançarski, S. (2017). Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66917-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66916-8

  • Online ISBN: 978-3-319-66917-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics