Skip to main content

Scraping Data from Web Pages Using SPARQL Queries

  • Conference paper
  • First Online:
Web Engineering (ICWE 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13893))

Included in the following conference series:

  • 524 Accesses

Abstract

Despite the increasing use of semantic data, plain old HTML web pages often provide a unique interface for accessing data from many domains. To use this data in computer applications or to integrate it with other data sources, it must be extracted from the HTML code. Currently, this is typically done by single-purpose programs called scrapers. For each data source, specific scrapers must be created, which requires a thorough analysis of the source page’s implementation in HTML. This makes writing and maintaining a set of scrapers a complex and time-consuming task. In this paper, we present an alternative approach that allows defining scrapers based on visual properties of the presented content instead of the HTML code structure. First, we render the source page and create an RDF graph that describes the visual properties of every piece of the displayed content. Next, we use SPARQL to query the model and extract the data. As we demonstrate with real-world examples, this approach allows us to easily define more robust scrapers that can be used across multiple web sites and that better cope with changes in the source documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.w3.org/TR/CSS22/visuren.html.

  2. 2.

    http://fitlayout.github.io/ontology/render.owl#.

  3. 3.

    http://fitlayout.github.io/ontology/segmentation.owl#.

  4. 4.

    https://github.com/FitLayout/sparql-web-scraping.

  5. 5.

    https://github.com/FitLayout/FitLayout.

  6. 6.

    https://github.com/FitLayout/sparql-web-scraping-results.

  7. 7.

    On Intel(R) Core(TM) i5-9500 CPU 3.00 GHz, 16 GB RAM.

  8. 8.

    The rendered page height was 183,294 pixels.

References

  1. Ashish, N., Knoblock, C.A.: Wrapper generation for semi-structured internet sources. SIGMOD Rec. 26(4), 8–15 (1997). https://doi.org/10.1145/271074.271078

    Article  Google Scholar 

  2. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: VLDB 2001, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  3. Bizer, C., Meusel, R., Primpeli, A., Brinkmann, A.: Web data commons - microdata, RDFa, JSON-LD, and microformat data sets - extraction results from the october 2022 common crawl corpus (2022). http://webdatacommons.org/structureddata/2022-12/stats/stats.html. Accessed 29 Jan 2023

  4. Dilmegani, C.: Best web scraping programming languages in 2023 with stats (2023). https://research.aimultiple.com/web-scraping-programming-languages/. Accessed 05 Feb 2023

  5. Fernández-Villamor, J.I., Blasco-García, J., Iglesias, C.A., Garijo, M.: A semantic scraping model for web resources - applying linked data to web page screen scraping. In: Proceedings of ICAART 2011, Roma, Italia (2011)

    Google Scholar 

  6. Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013). https://doi.org/10.1007/s00778-012-0286-6

    Article  Google Scholar 

  7. Gao, P., Han, H.: Robust web data extraction based on weighted path-layer similarity. J. Comput. Inf. Syst. 62(3), 536–546 (2022). https://doi.org/10.1080/08874417.2020.1861571

    Article  Google Scholar 

  8. Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14

    Chapter  Google Scholar 

  9. Hotti, A., Risuleo, R.S., Magureanu, S., Moradi, A., Lagergren, J.: Graph neural networks for nomination and representation learning of web elements (2021). https://doi.org/10.48550/ARXIV.2111.02168

  10. Kushmerick, N.: Wrapper verification. World Wide Web 3(2), 79–94 (2000). https://doi.org/10.1023/A:1019229612909

    Article  MATH  Google Scholar 

  11. Lin, B.Y., Sheng, Y., Vo, N., Tata, S.: FreeDOM: a transferable neural architecture for structured information extraction on web documents. In: KDD 2020, pp. 1092–1102. ACM, New York (2020). https://doi.org/10.1145/3394486.3403153

  12. Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25518-7_7

    Chapter  Google Scholar 

  13. Potvin, B., Villemaire, R.: Robust web data extraction based on unsupervised visual validation. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11431, pp. 77–89. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14799-0_7

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by project Smart information technology for a resilient society, FIT-S-23-8209, funded by Brno University of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radek Burget .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Burget, R. (2023). Scraping Data from Web Pages Using SPARQL Queries. In: Garrigós, I., Murillo Rodríguez, J.M., Wimmer, M. (eds) Web Engineering. ICWE 2023. Lecture Notes in Computer Science, vol 13893. Springer, Cham. https://doi.org/10.1007/978-3-031-34444-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34444-2_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34443-5

  • Online ISBN: 978-3-031-34444-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics