Skip to main content

An Empirical Comparison of Web Page Segmentation Algorithms

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2021)

Abstract

Over the past two decades, several algorithms have been developed to segment a web page into semantically coherent units, a task with several applications in web content analysis. However, these algorithms have hardly been compared empirically and it thus remains unclear which of them—or rather, which of their underlying paradigms—performs best. To contribute to closing this gap, we report on the reproduction and comparative evaluation of five segmentation algorithms on a large, standardized benchmark dataset for web page segmentation: Three of the algorithms have been specifically developed for web pages and have been selected to represent paradigmatically different approaches to the task, whereas the other two approaches originate from the segmentation of photos and print documents, respectively. For a fair comparison, we tuned each algorithm’s parameters, if applicable, to the dataset. Altogether, the classic rule-based VIPS algorithm achieved the highest performance, closely followed by the purely visual approach of Cormier et al. For reproducibility, we provide our reimplementations of the algorithms along with detailed instructions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Code + documentation: https://github.com/webis-de/ecir21-an-empirical-comparison-of-web-page-segmentation-algorithms

    Provenance data: https://doi.org/10.5281/zenodo.4146889.

  2. 2.

    Our port of https://github.com/tpopela/vips_java is available in the code repository of this paper.

  3. 3.

    https://github.com/tmanabe/HEPS.

  4. 4.

    https://cocodataset.org/#detection-leaderboard.

  5. 5.

    We use the model with X-101-64x4d-FPN backbone and c3-c5 DCN as available and suggested at https://github.com/open-mmlab/mmdetection/blob/master/configs/htc.

  6. 6.

    The authors reported an erratum in their publication to us, so we used the corrected kernel size of \(3\times 3\) instead of \(5\times 5\) for layers conv6-1 and conv7-1.

References

  1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (2009). https://doi.org/10.1007/s10791-008-9066-8

    Article  Google Scholar 

  2. Arias, J., Deschacht, K., Moens, M.F.: Language independent content extraction from web pages. In: Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55. University of Twente, Enschede (2009)

    Google Scholar 

  3. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36901-5_42

    Chapter  Google Scholar 

  4. Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155

  5. Chen, K., et al.: Hybrid task cascade for instance segmentation. CoRR abs/1901.07518 (2019). http://arxiv.org/abs/1901.07518

  6. Cormier, M., Mann, R., Moffatt, K., Cohen, R.: Towards an improved vision-based web page segmentation algorithm. In: 14th Conference on Computer and Robot Vision, CRV 2017, pp. 345–352 (2017). https://doi.org/10.1109/CRV.2017.38

  7. Cormier, M., Moffatt, K., Cohen, R., Mann, R.: Purely vision-based segmentation of web pages for assistive technology. Comput. Vis. Image Underst. 148, 46–66 (2016). https://doi.org/10.1016/j.cviu.2016.02.007

    Article  Google Scholar 

  8. Goldstein, E.B.: Sensation and Perception. Cengage Learning, 8th edn. (2009). ISBN 9780495601494

    Google Scholar 

  9. Kiesel, J., Kneist, F., Alshomary, M., Stein, B., Hagen, M., Potthast, M.: Reproducible web corpora: interactive archiving with automatic quality assessment. J. Data Inf. Qual. (JDIQ) 10(4), 17:1–17:25 (2018). https://doi.org/10.1145/3239574. https://dl.acm.org/authorize?N676358

    Article  Google Scholar 

  10. Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B., Potthast, M.: Web page segmentation revisited: evaluation framework and dataset. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P. (eds.) 29th ACM International Conference on Information and Knowledge Management (CIKM 2020), pp. 3047–3054. ACM (October 2020). https://doi.org/10.1145/3340531.3412782

  11. Kumar, R., et al.: Webzeitgeist: design mining the web. In: Mackay, W.E., Brewster, S.A., Bødker, S. (eds.) 2013 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2013, pp. 3083–3092. ACM (2013). https://doi.org/10.1145/2470654.2466420

  12. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  13. Ma, L., Goharian, N., Chowdhury, A.: Automatic data extraction from template generated web pages. In: Arabnia, H.R., Mun, Y. (eds.) Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2003, pp. 642–648. CSREA Press (2003)

    Google Scholar 

  14. Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. PVLDB 8(12), 1606–1617 (2015)

    Google Scholar 

  15. Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 414–419. IEEE (2017)

    Google Scholar 

  16. Nielsen, J., Pernice, K.: Eyetracking Web Usability. Pearson Education, London (2010). ISBN 9780321714077

    Google Scholar 

  17. Zeleny, J., Burget, R., Zendulka, J.: Box clustering segmentation: a new method for vision-based web page preprocessing. Inf. Process. Manag. 53(3), 735–750 (2017). https://doi.org/10.1016/j.ipm.2017.02.002

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their helpful comments and the authors of the respective algorithms for providing us with either their code and/or their support for our re-implementations, as stated in the respective section.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Kiesel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kiesel, J., Meyer, L., Kneist, F., Stein, B., Potthast, M. (2021). An Empirical Comparison of Web Page Segmentation Algorithms. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12657. Springer, Cham. https://doi.org/10.1007/978-3-030-72240-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72240-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72239-5

  • Online ISBN: 978-3-030-72240-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics