Abstract
Over the past two decades, several algorithms have been developed to segment a web page into semantically coherent units, a task with several applications in web content analysis. However, these algorithms have hardly been compared empirically and it thus remains unclear which of them—or rather, which of their underlying paradigms—performs best. To contribute to closing this gap, we report on the reproduction and comparative evaluation of five segmentation algorithms on a large, standardized benchmark dataset for web page segmentation: Three of the algorithms have been specifically developed for web pages and have been selected to represent paradigmatically different approaches to the task, whereas the other two approaches originate from the segmentation of photos and print documents, respectively. For a fair comparison, we tuned each algorithm’s parameters, if applicable, to the dataset. Altogether, the classic rule-based VIPS algorithm achieved the highest performance, closely followed by the purely visual approach of Cormier et al. For reproducibility, we provide our reimplementations of the algorithms along with detailed instructions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Code + documentation: https://github.com/webis-de/ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms
Provenance data: https://doi.org/10.5281/zenodo.4146889.
- 2.
Our port of https://github.com/tpopela/vips_java is available in the code repository of this paper.
- 3.
- 4.
- 5.
We use the model with X-101-64x4d-FPN backbone and c3-c5 DCN as available and suggested at https://github.com/open-mmlab/mmdetection/blob/master/configs/htc.
- 6.
The authors reported an erratum in their publication to us, so we used the corrected kernel size of \(3\times 3\) instead of \(5\times 5\) for layers conv6-1 and conv7-1.
References
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (2009). https://doi.org/10.1007/s10791-008-9066-8
Arias, J., Deschacht, K., Moens, M.F.: Language independent content extraction from web pages. In: Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55. University of Twente, Enschede (2009)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36901-5_42
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155
Chen, K., et al.: Hybrid task cascade for instance segmentation. CoRR abs/1901.07518 (2019). http://arxiv.org/abs/1901.07518
Cormier, M., Mann, R., Moffatt, K., Cohen, R.: Towards an improved vision-based web page segmentation algorithm. In: 14th Conference on Computer and Robot Vision, CRV 2017, pp. 345–352 (2017). https://doi.org/10.1109/CRV.2017.38
Cormier, M., Moffatt, K., Cohen, R., Mann, R.: Purely vision-based segmentation of web pages for assistive technology. Comput. Vis. Image Underst. 148, 46–66 (2016). https://doi.org/10.1016/j.cviu.2016.02.007
Goldstein, E.B.: Sensation and Perception. Cengage Learning, 8th edn. (2009). ISBN 9780495601494
Kiesel, J., Kneist, F., Alshomary, M., Stein, B., Hagen, M., Potthast, M.: Reproducible web corpora: interactive archiving with automatic quality assessment. J. Data Inf. Qual. (JDIQ) 10(4), 17:1–17:25 (2018). https://doi.org/10.1145/3239574. https://dl.acm.org/authorize?N676358
Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B., Potthast, M.: Web page segmentation revisited: evaluation framework and dataset. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) 29th ACM International Conference on Information and Knowledge Management (CIKM 2020), pp. 3047–3054. ACM (October 2020). https://doi.org/10.1145/3340531.3412782
Kumar, R., et al.: Webzeitgeist: design mining the web. In: Mackay, W.E., Brewster, S.A., Bødker, S. (eds.) 2013 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2013, pp. 3083–3092. ACM (2013). https://doi.org/10.1145/2470654.2466420
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Ma, L., Goharian, N., Chowdhury, A.: Automatic data extraction from template generated web pages. In: Arabnia, H.R., Mun, Y. (eds.) Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2003, pp. 642–648. CSREA Press (2003)
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. PVLDB 8(12), 1606–1617 (2015)
Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 414–419. IEEE (2017)
Nielsen, J., Pernice, K.: Eyetracking Web Usability. Pearson Education, London (2010). ISBN 9780321714077
Zeleny, J., Burget, R., Zendulka, J.: Box clustering segmentation: a new method for vision-based web page preprocessing. Inf. Process. Manag. 53(3), 735–750 (2017). https://doi.org/10.1016/j.ipm.2017.02.002
Acknowledgements
We thank the anonymous reviewers for their helpful comments and the authors of the respective algorithms for providing us with either their code and/or their support for our re-implementations, as stated in the respective section.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kiesel, J., Meyer, L., Kneist, F., Stein, B., Potthast, M. (2021). An Empirical Comparison of Web Page Segmentation Algorithms. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12657. Springer, Cham. https://doi.org/10.1007/978-3-030-72240-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-72240-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72239-5
Online ISBN: 978-3-030-72240-1
eBook Packages: Computer ScienceComputer Science (R0)