Scraping Data from Web Pages Using SPARQL Queries

Burget, Radek

doi:10.1007/978-3-031-34444-2_21

Radek Burget ORCID: orcid.org/0000-0001-5233-0456¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13893))

Included in the following conference series:

International Conference on Web Engineering

524 Accesses

Abstract

Despite the increasing use of semantic data, plain old HTML web pages often provide a unique interface for accessing data from many domains. To use this data in computer applications or to integrate it with other data sources, it must be extracted from the HTML code. Currently, this is typically done by single-purpose programs called scrapers. For each data source, specific scrapers must be created, which requires a thorough analysis of the source page’s implementation in HTML. This makes writing and maintaining a set of scrapers a complex and time-consuming task. In this paper, we present an alternative approach that allows defining scrapers based on visual properties of the presented content instead of the HTML code structure. First, we render the source page and create an RDF graph that describes the visual properties of every piece of the displayed content. Next, we use SPARQL to query the model and extract the data. As we demonstrate with real-world examples, this approach allows us to easily define more robust scrapers that can be used across multiple web sites and that better cope with changes in the source documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.w3.org/TR/CSS22/visuren.html.
2.
http://fitlayout.github.io/ontology/render.owl#.
3.
http://fitlayout.github.io/ontology/segmentation.owl#.
4.
https://github.com/FitLayout/sparql-web-scraping.
5.
https://github.com/FitLayout/FitLayout.
6.
https://github.com/FitLayout/sparql-web-scraping-results.
7.
On Intel(R) Core(TM) i5-9500 CPU 3.00 GHz, 16 GB RAM.
8.
The rendered page height was 183,294 pixels.

References

Ashish, N., Knoblock, C.A.: Wrapper generation for semi-structured internet sources. SIGMOD Rec. 26(4), 8–15 (1997). https://doi.org/10.1145/271074.271078
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB 2001, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Bizer, C., Meusel, R., Primpeli, A., Brinkmann, A.: Web data commons - microdata, RDFa, JSON-LD, and microformat data sets - extraction results from the october 2022 common crawl corpus (2022). http://webdatacommons.org/structureddata/2022-12/stats/stats.html. Accessed 29 Jan 2023
Dilmegani, C.: Best web scraping programming languages in 2023 with stats (2023). https://research.aimultiple.com/web-scraping-programming-languages/. Accessed 05 Feb 2023
Fernández-Villamor, J.I., Blasco-García, J., Iglesias, C.A., Garijo, M.: A semantic scraping model for web resources - applying linked data to web page screen scraping. In: Proceedings of ICAART 2011, Roma, Italia (2011)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013). https://doi.org/10.1007/s00778-012-0286-6
Article Google Scholar
Gao, P., Han, H.: Robust web data extraction based on weighted path-layer similarity. J. Comput. Inf. Syst. 62(3), 536–546 (2022). https://doi.org/10.1080/08874417.2020.1861571
Article Google Scholar
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14
Chapter Google Scholar
Hotti, A., Risuleo, R.S., Magureanu, S., Moradi, A., Lagergren, J.: Graph neural networks for nomination and representation learning of web elements (2021). https://doi.org/10.48550/ARXIV.2111.02168
Kushmerick, N.: Wrapper verification. World Wide Web 3(2), 79–94 (2000). https://doi.org/10.1023/A:1019229612909
Article MATH Google Scholar
Lin, B.Y., Sheng, Y., Vo, N., Tata, S.: FreeDOM: a transferable neural architecture for structured information extraction on web documents. In: KDD 2020, pp. 1092–1102. ACM, New York (2020). https://doi.org/10.1145/3394486.3403153
Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25518-7_7
Chapter Google Scholar
Potvin, B., Villemaire, R.: Robust web data extraction based on unsupervised visual validation. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11431, pp. 77–89. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14799-0_7
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by project Smart information technology for a resilient society, FIT-S-23-8209, funded by Brno University of Technology.

Author information

Authors and Affiliations

Faculty of Information Technology, Brno University of Technology, Bozetechova 2, 61266, Brno, Czechia
Radek Burget

Authors

Radek Burget
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radek Burget .

Editor information

Editors and Affiliations

University of Alicante, Alicante, Spain
Irene Garrigós
University of Extremadura, Cáceres, Spain
Juan Manuel Murillo Rodríguez
Johannes Kepler University Linz, Linz, Austria
Manuel Wimmer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Burget, R. (2023). Scraping Data from Web Pages Using SPARQL Queries. In: Garrigós, I., Murillo Rodríguez, J.M., Wimmer, M. (eds) Web Engineering. ICWE 2023. Lecture Notes in Computer Science, vol 13893. Springer, Cham. https://doi.org/10.1007/978-3-031-34444-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-34444-2_21
Published: 16 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34443-5
Online ISBN: 978-3-031-34444-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scraping Data from Web Pages Using SPARQL Queries