Abstract
Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Googling the term “translating html 4 tag to html5” will give these references.
- 3.
- 4.
- 5.
- 6.
Usually a tree.
- 7.
References
Cao, J., Mao, B., Luo, J.: A segmentation method for web page analysis using shrinking and dividing. Int. J. Parallel Emergent Distrib. Syst. 25(2), 93–104 (2010)
Garret, J.: Preserving digital information. Technical report, Commission on Preservation and Access and the Research Libraries Group (1996)
Jackson, A.N.: Formats over time: exploring UK web history. CoRR, pp. 1210–1714 (2012)
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1173–1182, New York, NY, USA. ACM (2008)
Laws, B.: Seriously, another format? You must be kidding. CSE News 36(2), 41 (2013)
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 588–593, Edmonton, Alberta, Canada. ACM (2002). ISBN: 1-58113-567-X. doi:10.1145/775047.775134
Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22Nd International Conference on Computational Linguistics, vol. 1, COLING 2008, pp. 593–600, Stroudsburg, PA, USA. Association for Computational Linguistics (2008). ISBN: 978-1-905593-44-6, http://dl.acm.org/citation.cfm?id=1599081.1599156
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi:10.1145/375360.375365. ISSN: 0360-0300
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, pp. 13–15 (2013)
Park, S.H., Lynberg, N., Racer, J., McElmurray, P., Fox, E.A.: Html5 etds. In: Proceedings of International Symposium on Electronic thesis and Dissertations, Austin, TX, USA (2010)
Rosenthal, D.S.H., Lipkis, T., Robertson, T., Morabito, S.: Transparent format migration of preserved web content. D-Lib Mag. 11(1) (2005). http://dblp.uni-trier.de/db/journals/dlib/dlib11.html#RosenthalLRM05
Rosenthal, D.S.H.: Format obsolescence: assessing the threat and the defenses. Libr. Hi Tech 28(2), 195–210 (2010)
Sanoja, A.: Web page segmentation, evaluation and applications. PhD thesis, Université Pierre et Marie Curie-Paris VI (2015). https://hal.inria.fr/tel-01128002/
Sanoja, A., Gançarski, S.: Block-o-matic: a web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600, Marrakesh, Moroco, April 2014
Sanoja, A., Gançarski, S.: Web page segmentation evaluation. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 753–760. ACM (2015)
Solis, B.: The conversation prism (2014). https://conversationprism.com
Van der Hoeven, J.: Emulation for digital preservation in practice: the results. Int. J. Dig. Curation 2(2), 123–132 (2007)
W3Schools.com. HTML5 Migration: Migration from HTML4 to HTML5. W3Schools (2016). http://www.w3schools.com/html/html5_migration.asp
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sanoja, A., Gançarski, S. (2017). Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-66917-5_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)