Reasoning Web. Semantic Technologies for Advanced Query Answering

Volume 7487 of the series Lecture Notes in Computer Science pp 184-210

Reasoning and Ontologies in Data Extraction

  • Sergio FlescaAffiliated withDEIS, University of Calabria
  • , Tim FurcheAffiliated withDepartment of Computer Science, Oxford University
  • , Linda OroAffiliated withICAR-CNR, University of Calabria

* Final gross prices may vary according to local VAT.

Get Access


The web has become a pig sty—everyone dumps information at random places and in random shapes. Try to find the cheapest apartment in Oxford considering rent, travel, tax and heating costs; or a cheap, reasonable reviewed 11” laptop with an SSD drive.

Data extraction flushes structured information out of this sty: It turns mostly unstructured web pages into highly structured knowledge. In this chapter, we give a gentle introduction to data extraction including pointers to existing systems. We start with an overview and classification of data extraction systems along two primary dimensions, the level of supervision and the considered scale. The rest of the chapter is organized along the major division of these approaches into site-specific and supervised versus domain-specific and unsupervised. We first discuss supervised data extraction, where a human user identifies for each site examples of the relevant data and the system generalizes these examples into extraction programs. We focus particularly on declarative and rule-based paradigms. In the second part, we turn to fully automated (or unsupervised) approaches where the system by itself identifies the relevant data and fully automatically extracts data from many websites. Ontologies or schemata have proven invaluable to guide unsupervised data extraction and we present an overview of the existing approaches and the different ways in which they are using ontologies.