Abstract
Despite the increasing use of semantic data, plain old HTML web pages often provide a unique interface for accessing data from many domains. To use this data in computer applications or to integrate it with other data sources, it must be extracted from the HTML code. Currently, this is typically done by single-purpose programs called scrapers. For each data source, specific scrapers must be created, which requires a thorough analysis of the source page’s implementation in HTML. This makes writing and maintaining a set of scrapers a complex and time-consuming task. In this paper, we present an alternative approach that allows defining scrapers based on visual properties of the presented content instead of the HTML code structure. First, we render the source page and create an RDF graph that describes the visual properties of every piece of the displayed content. Next, we use SPARQL to query the model and extract the data. As we demonstrate with real-world examples, this approach allows us to easily define more robust scrapers that can be used across multiple web sites and that better cope with changes in the source documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
On Intel(R) Core(TM) i5-9500 CPU 3.00 GHz, 16 GB RAM.
- 8.
The rendered page height was 183,294 pixels.
References
Ashish, N., Knoblock, C.A.: Wrapper generation for semi-structured internet sources. SIGMOD Rec. 26(4), 8–15 (1997). https://doi.org/10.1145/271074.271078
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB 2001, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Bizer, C., Meusel, R., Primpeli, A., Brinkmann, A.: Web data commons - microdata, RDFa, JSON-LD, and microformat data sets - extraction results from the october 2022 common crawl corpus (2022). http://webdatacommons.org/structureddata/2022-12/stats/stats.html. Accessed 29 Jan 2023
Dilmegani, C.: Best web scraping programming languages in 2023 with stats (2023). https://research.aimultiple.com/web-scraping-programming-languages/. Accessed 05 Feb 2023
Fernández-Villamor, J.I., Blasco-García, J., Iglesias, C.A., Garijo, M.: A semantic scraping model for web resources - applying linked data to web page screen scraping. In: Proceedings of ICAART 2011, Roma, Italia (2011)
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013). https://doi.org/10.1007/s00778-012-0286-6
Gao, P., Han, H.: Robust web data extraction based on weighted path-layer similarity. J. Comput. Inf. Syst. 62(3), 536–546 (2022). https://doi.org/10.1080/08874417.2020.1861571
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14
Hotti, A., Risuleo, R.S., Magureanu, S., Moradi, A., Lagergren, J.: Graph neural networks for nomination and representation learning of web elements (2021). https://doi.org/10.48550/ARXIV.2111.02168
Kushmerick, N.: Wrapper verification. World Wide Web 3(2), 79–94 (2000). https://doi.org/10.1023/A:1019229612909
Lin, B.Y., Sheng, Y., Vo, N., Tata, S.: FreeDOM: a transferable neural architecture for structured information extraction on web documents. In: KDD 2020, pp. 1092–1102. ACM, New York (2020). https://doi.org/10.1145/3394486.3403153
Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25518-7_7
Potvin, B., Villemaire, R.: Robust web data extraction based on unsupervised visual validation. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11431, pp. 77–89. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14799-0_7
Acknowledgements
This work was supported by project Smart information technology for a resilient society, FIT-S-23-8209, funded by Brno University of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Burget, R. (2023). Scraping Data from Web Pages Using SPARQL Queries. In: Garrigós, I., Murillo Rodríguez, J.M., Wimmer, M. (eds) Web Engineering. ICWE 2023. Lecture Notes in Computer Science, vol 13893. Springer, Cham. https://doi.org/10.1007/978-3-031-34444-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-34444-2_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34443-5
Online ISBN: 978-3-031-34444-2
eBook Packages: Computer ScienceComputer Science (R0)