Extraction and Semantic Annotation of Workshop Proceedings in HTML Using RML
Despite the significant number of existing tools, incorporating data into the Linked Open Data cloud remains complicated; hence discouraging data owners to publish their data as Linked Data. Unlocking the semantics of published data, even if they are not provided by the data owners, can contribute to surpass the barriers posed by the low availability of Linked Data and come closer to the realisation of the envisaged Semantic Web. rml, a generic mapping language based on an extension over Open image in new window, the Open image in new window standard for mapping relational databases into rdf, offers a uniform way of defining the mapping rules for data in heterogeneous formats. In this paper, we present how we adjusted our prototype rml Processor, taking advantage of rml’s scalability, to extract and map data of workshop proceedings published in html to the rdf data model for the Semantic Publishing Challenge needs.
Data owners lack of incentives to publish their data in a format processable by Semantic Web clients, partly because incorporating data into the Linked Open Data still remains complicated. Generic solutions fail to efficiently support them, as it is impossible to predict every potential input, while case-specific solutions, in their turn, need individual investment and they are not reused at the end. Furthermore, most of the existing solutions are source-specific. Only few tools provide mappings from different source formats to rdf; but even those tools actually employ separate source-centric approaches for each of the supported formats. Thus, whenever a new need to map data from a source in an arbitrary format emerges, the whole implementation is developed from scratch.
The low availability of Linked Data, mainly caused by data owners who do not publish their data as Linked Data for different reasons, remains a barrier to the realisation of the Semantic Web. There are a lot of data published as Open Data but even more data are published on Web pages and only a few of them have any semantic annotations. Unlocking the semantics of this data is of high importance if we want to be able to query their content. Therefore, we need solutions that allow us easily to get the published data in rdf, even if the data providers do not publish them as such.
In an effort to address, among others, the aforementioned issues, we defined rml  in our previous works. In the frame of the Semantic Publishing challenge1, selected computer science workshop proceedings published with the CEUR-WS.org open access service were mapped in rdf in order to answer more complicated queries related to the quality of the workshops. To address the challenge of semantically annotating the content of html pages, we exploited and proved rml’s extensibility and flexibility. Our rml Processor implementation2, which was configured so far to map data in csv, xml and json formats, was extended further to support mapping of data in html to the rdf data model.
2 Related Work
Most of the proposed solutions for publishing data in html Web pages, rely on the page’s dom or on processing the html source as xml documents. A variety of solutions map data in valid XHTML3 pages to rdf using Gleaning Resource Descriptions from Dialects of Languages (GRDDL) , such as Triplr4. GRDDL essentially provides the links, identified by uris, to the transformations, typically represented in XS, that map the data to rdf. Other approaches chose alternative solutions, such as executing XQuery statements against the dom of html pages .
Approaching (x)html pages as xml documents implies that they should be well-formed documents, as wrong syntax, misused labels, or any type of inconsistencies cause the entire mapping to fail. To deal with invalid html documents, Coetzee , for instance, balances the tags and validates the model before performing the mappings. However prior cleansing and re-formatting is not always possible, especially when performing mappings on-the-fly.
3 HTML to RDF Mappings with RML
The RDF Mapping language (rml)5 is a generic language defined to express customized mapping rules from data in heterogeneous formats to the rdf data model . rml is defined as a superset of the Open image in new window-standardized mapping language Open image in new window, extending its applicability and broadening its scope. rml keeps the mapping definitions as in Open image in new window and follows the same syntax, providing a generic way of defining the mappings that is easily transferable to cover references to other data structures, combined with case-specific extensions. rml considers that sets of sources that all together describe a certain domain, can be mapped to rdf in a combined and uniform way, while the mapping definitions may be re-used across different sources that describe the same domain.
Structure of an RML Mapping
In rml, the mapping to the rdf data model is based on one or more Triples Maps. A Triples Map consists of three main parts: the Logical Source (rr:LogicalSource), the Subject Map and zero or more Predicate-Object Maps. The Subject Map (rr:SubjectMap) defines the rule that generates unique identifiers (uris) for the resources which are mapped and is used as the subject of all the rdf triples that are generated from this Triples Map. A Predicate-Object Map consists of Predicate Maps, which define the rule that generates the triple’s predicate and Object Maps or Referencing Object Maps, which defines the rule that generates the triple’s object. The Subject Map, the Predicate Map and the Object Map are Term Maps, namely rules that generate an rdf term (an iri, a blank node or a literal).
Leveraging HTML with RML
A Logical Source (rml:LogicalSource) is used to determine the input source with the data to be mapped. rml deals with different data serializations which use different ways to refer to their content. Thus rml considers that any reference to the Logical Source should be defined in a form relevant to the input data, e.g. xpath for xml files or jsonpath for json files. The Reference Formulation (rml:referenceFormulation) indicates the formulation (for instance, a standard or a query language) to refer to its data. Any reference to the data of the input source must be valid expressions according to the Reference Formulation defined at the Logical Source. This makes rml highly extensible towards new source formats.
At the current version of rml, the ql:CSV, ql:XPath and ql:JSONPathReference Formulations are predefined while the ql:CSS3 was introduced for the challenge’s needs as we chose the Selectors Level 3 expressions (Open image in new window)6 to access the elements within the document. Open image in new window selectors are standardized by Open image in new window, they are easily used and broadly-known as they are used for selecting the html elements both for cascading styles and for jQuery7. Open image in new window selectors can be used not only to refer to data in html documents but they could also be used for xml documents.
Defining RML Documents for CEUR Proceedings
The vocabularies used to describe the domain were selected to be aligned with the annotations provided in the case of volumes that already included RDFa annotations and considering vocabularies relevant to the domain as listed at http://linkeduniversities.org/lu/index.php/vocabularies/. The rml document for the challenge can be found at http://rml.io/spc/spc.html.
4 Performing Mappings to RDF with RML
Defining and executing a mapping with rml requires the user to provide an input source to be mapped and the mapping document according to which the mapping will be executed to generate the corresponding rdfoutput dataset. Data cleansing is out of rml’s scope and should be performed in advance. Baring in mind that such data cleansing is not always possible, e.g. mapping live html documents on-the-fly, regular expressions were preferred to be used whenever it is required to be more selective over the returned values. For instance, a reference to h3 span.CEURLOCTIME returns Montpellier, France, May 26, 2013 for the aforementioned example and, as there is no further html annotation, regular expressions are required to select parts of the returned value to be mapped separately(e.g. city).
Performing HTML to RDF Mappings with the RML Processor
Our prototype rml processor8, implemented in Java, was used but, for the challenge needs, we extended it to leverage also html documents. We used CSSelly9, a Java implementation of the Open image in new window Open image in new window specification. The html documents were stored locally and mapped as the rml processor was implemented so far with the scope of mapping files owned by data publishers and existing locally to the system. The definition of rml though allows to refer to resources even if they are published on the web and be retrieved as Web resources instead of local files.
The core functionality of the processor is used as such, we only added the Open image in new window selectors to access the html input. Each defined Triples Maps is processed in a consecutive order and the defined Subject Map and Predicate-Object Maps are applied. For each reference to the input html, the html extractor returns an extract of the data. If a regular expression is specified, it is applied over the returned value and the corresponding triples are generated. The output dataset for the challenge can be found at http://rml.io/spc/spc.html.
5 Discussion and Conclusions
It is beneficial that Open image in new window selectors become part of a formalisation that performs mappings of data in html. Considering that the rml processor takes care of executing the mappings while the Open image in new window extractor parses the document, the data publishers’ contribution is limited in providing only the mapping document. As rml enables the re-use of the same mappings over different files, the effort they put is even less. In the case of the challenge, the same mapping documents were used to define the mappings for different html input sources.
This happens because most of the websites use templates thus the content of their pages is structured in a similar way, which is defined using Open image in new window selectors, the same point of reference as the one used by rml. This allows us to use rml mapping documents as a “translation layer” over the published content and extract the content. Furthermore, as the mappings are partitioned in independent Triples Maps, data owners can select the Triples Maps they want to execute at any time. This provides them with the flexibility to execute only a part of the mappings at any time. For instance, if they identify a faulty mapping to their rdf output, they can isolate the Triples Map that generated those triples, correct it and re-execute it without affecting the rest of the dataset.
This becomes even more valuable considering that the mappings in rml are defined as triples themselves. The triples’ provenance can be tracked and used to identify the mappings and data that cause the “faulty” rdf result . Last, the mapping rules are interoperable; any tool that supports rml can process them either to execute them, as our rml Processor does or to refine them, e.g. by importing them to an application, such as Karma10 or OpenRefine11.
Beyond re-using the same mapping documents, data publishers can combine data from different input sources either they are in the same format or not. This leads to enhanced results as integration of data from different sources occurs during the mapping and relations between data appearing in different resources can be defined instead of interlinking them afterwards. For instance, the proceedings appearing in html can be mapped in an integrated fashion with the xml versions of the papers published at the workshops, enriching the resulting dataset with properties defined considering the combination of the two documents.
To sum up, this solution proves the scalability of the rml, as it was successfully extended to define mappings from data in html to the rdf data model.
The described research activities were funded by Ghent University, the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research Flanders (FWO Flanders), and the European Union.
- 1.Coetzee, P., Heath, T., Motta, E.: Sparqplug: generating linked data from legacy HTML, SPARQL and the DOM (2008)Google Scholar
- 2.Connolly, D.: Gleaning resource descriptions from dialects of languages (GRDDL). W3C recommendation, September 2007Google Scholar
- 3.Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous data. In: Workshop on Linked Data on the Web (2013)Google Scholar
- 4.Dimou, A., Vander Sande, M., De Nies, T., Verborgh, R., Mannens, E., Van de Walle, R.: RDF mapping rules refinements according to data consumers feedback. In: 2nd International World Wide Web Conference, Poster Track Proceedings (2014)Google Scholar
- 5.Droop, M., et al.: Translating XPath queries into SPARQL queries. In: Meersman, R., Tari, Z. (eds.) OTM 2007 Workshops, Part I. LNCS, vol. 4805, pp. 9–10. Springer, Heidelberg (2007)Google Scholar