Semantically Annotating CEUR-WS Workshop Proceedings with RML
In this paper, we present our solution for the first task of the second Semantic Publishing Challenge. The task requires extracting and semantically annotating information regarding ceur-ws workshops, their chairs and conference affiliations, as well as their papers and their authors, from a set of html-encoded workshop proceedings volumes. Our solution builds on last year’s submission, while we address a number of shortcomings, assess the generated dataset for its quality and publish the queries as sparql query templates. This is accomplished using the rdf Mapping Language (rml) to define the mappings, the rmlprocessor to execute them, the rdfunit to both validate the mapping documents and assess the generated dataset’s quality, and the datatank to publish the sparql query templates. This results in an overall improved quality of the generated dataset that is reflected in the query results.
A lot of information is available on the Web through websites. However, this information is not always processable by Semantic Web enabled systems, because most html pages lack the required metadata. An example of such a website is ceur-ws Workshop Proceedings (ceur-ws)1. ceur-ws is a publication service for proceedings of scientific workshops. It provides (i) a list of all the volumes indexed in a single Web page; and (ii) a detailed Web page for each volume. In need of assessing the scientific output quality, the Semantic Publishing Challenge (spc14) was organized in 20142, followed by this year’s edition3 (spc15).
In this paper, we propose a solution to solve the challenge’s first task4, which includes extracting information regarding workshops, their chairs and conference affiliations, as well as their papers and their authors, from a set of html-encoded tables of workshop proceedings volumes. In order to achieve this, we build on last year’s submission . The solution uses the rdf Mapping language (rml)5 [2, 3], which is a generic mapping language based on an extension over r2rml, the w3c standard for mapping relational databases into rdf. rml offers a uniform way of defining the mapping rules for data in heterogeneous formats.
We follow the same approach as last year. However, we (i) address a number of shortcomings, (ii) assess the generated dataset for its quality and (iii) publish the queries as sparql query templates. This is accomplished using rml (see Sect. 4) to define the mappings, the rmlprocessor to execute them, the rdfunit to both validate the mapping documents and assess the generated dataset’s quality (see Sect. 8.2), and the datatank to publish the sparql query templates (see Sect. 8.3).
This paper that supports our submission to the spc15 is structured as follows: we state the problem in Sect. 2, and give an overview of our approach in Sect. 3. In Sect. 4 we elaborate on the basis of the solution, namely rml. After defining how the data is modeled in Sect. 5, we elaborate on how the mapping is done in Sect. 6. We discuss how the queries of the task are evaluated in Sect. 7. In Sect. 8 we explain the used tools: rmlprocessor (Sect. 8.1), the rdfunit (Sect. 8.2) and the datatank (Sect. 8.3). Finally, in Sect. 9, we discuss our solution and its results, after which we form our conclusions.
2 Problem Statement
Task 1 Extraction and assessment of workshop proceedings information,
Task 2 Extracting contextual information from the papers text in pdf, and
Task 3 Interlinking
In this paper we explain how we tackle the first task of the challenge. The participants are asked to extract information from a set of html tables published as Web pages in the ceur-ws workshop proceedings. The information is obtained from the html pages’ content which is semantically annotated and represented using the rdf data model. The extracted information is expected to answer queries about the quality of these workshops, for instance by measuring growth, longevity, and so on. The task is an extension of the spc14 ’s first task. The most challenging quality indicators from last year’s challenge are reused. However, a number of them are defined more precisely, and new indicators are added. This results in the following three subtasks:
- SubTask 1.1
Extract information from the html input pages;
- SubTask 1.2
Annotate the information with appropriate ontologies and vocabularies; and
- Subtask 1.3
Publish the semantically enriched representation with the rdf data model.
3 Overview of Our Approach
define the mapping documents, using rml;
assess the mapping documents, using the rdfunit;
generate the dataset, by executing the mappings, using the rmlprocessor;
assess the quality of the dataset, using therdfunit, and
publish the dataset, using the datatank.
define the queries, using sparql templates, using the datatank,
instantiate and execute the sparql queries, and
provide the results.
rdf Mapping Language (rml) [2, 3] is a generic language defined to express customized mapping rules from data in heterogeneous formats to the rdf data model. rml is defined as a superset of the w3c-standardized mapping language r2rml , extending its applicability and broadening its scope. rml keeps the mapping definitions as in r2rml and follows the same syntax, providing a generic way of defining the mappings that is easily transferable to cover references to other data structures, combined with case-specific extensions, making rml highly extensible towards new source formats.
4.1 Structure of an RML Mapping Document
In rml, the mapping to the rdf data model is based on one or more Triples Maps that define how rdf triples should be generated. A Triples Map consists of three main parts: (i) the Logical Source (rml:LogicalSource), (ii) the Subject Map, and (iii) zero or more Predicate Object Maps.
The Subject Map (rr:SubjectMap) defines the rule that generates unique identifiers (uris) for the resources which are mapped and is used as the subject of all rdf triples generated from this Triples Map. A Predicate Object Map consists of Predicate Maps, which define the rule that generates the triple’s predicate and Object Maps or Referencing Object Maps, which define the rule that generates the triple’s object. The Subject Map, the Predicate Map and the Object Map are Term Maps, namely rules that generate an rdf term (an iri, a blank node or a literal).
4.2 Leveraging HTML with RML
A Logical Source (rml:LogicalSource) is used to determine the input source with the data to be mapped. rml deals with different data serializations which use different ways to refer to their content. Thus, rml considers that any reference to the Logical Source should be defined in a form relevant to the input data, e.g., xpath for xml files or jsonpath for json files. The Reference Formulation (rml:referenceFormulation) indicates the formulation (for instance, a standard or a query language) to refer to its data. Any reference to the data of the input source must be valid expressions according to the Reference Formulation stated at the Logical Source. This makes rml highly extensible towards new source formats.
5 Data Modeling
The Bibliographic Ontology8 (with prefix bibo),
DCMI Metadata Terms9 (with prefix dcterms),
Friend of a Friend10 (with prefix foaf),
RDF Schema11 (with prefix rdfs),
FRBR-aligned Bibliographic Ontology12 (with prefix fabio)
The Event Ontology13 (with prefix event)
Semantic Web for Research Communities14 (with prefix swrc)
The classes used to determine the type of the entities are denoted in Table 2.
The properties used to annotate the entities and determine the relationships among them are denoted in Table 3. The properties listed here are not exhaustive, and for a complete overview of the used properties we refer to the mapping documents15. An overview of the entities and the relationships between the entities and the properties that determine them is shown in Fig. 1. Overall, the modelling of the data is driven by the queries that need to be answered as part of the challenge.
We extracted information related to workshop (bibo:Workshop) entities from the index page. Furthermore, we extracted information that models the relationship among different workshops (rdfs:seeAlso) of the same series, that denotes which proceedings are presented at a workshop (bibo:presentedAt) and states the conference that the workshop was co-located with (dcterms:isPartOf). To determine the workshops we iterated over the volumes, because, except for the joint volumes, all of them represent a separate workshop. Finally, the workshops related to the current one are added by following the ‘see also’ links in its description.
proceedings of a workshop
event where a workshop took place
editor of a proceedings and author of a paper
person who is author of a paper
proceedings that a paper belong to
proceedings that a supplemental document (e.g., invited paper) belong to
person who is editor of proceedings
workshops that the papers, hence, the proceedings, are presented
workshop series that a workshop is part of
workshop that is related to this workshop
event that the workshop is a subevent of
6 Mapping CEUR-WS from HTML to RDF
The task refers to two types of html pages that serve as input. On the one hand it is the index page listing all the volumes, namely http://ceur-ws.org. On the other hand, for each volume there is an html page that contains more detailed information, e.g., http://ceur-ws.org/Vol-1165/.
6.1 Defining the Mappings
6.2 Executing the Mappings
Executing an rml mapping requires a mapping document that summarizes all Triples Maps and points to an input data source. The mapping document is executed by an rml processor and the corresponding rdf output is generated. Each Triples Map is processed and the defined Subject Map and Predicate Object Maps are applied to the input data. For each reference to the input html, the css3 extractor returns an extract of the data and the corresponding triples are generated. The resulting rdf can be exporting in a user-specified serialization format. This solves subtask 1.3.
Data cleansing is out of rml ’s scope. However, the values extracted from the input is not always exactly as desired to be represented in rdf and the situation aggravates when mapping e.g. live html documents on-the-fly, where neither pre-processing is possible nor being as selective as desired purely based on css3 expressions to retrieve extracts from html pages. To this end, we defined and used rml:process, rml:replace and rml:split to further process the values returned from the input source as defined within a mapping rule. To be more precise, rml:process and rml:replace were used to define regular expressions whenever it is required to be more selective over the returned value and replaced by a part of the value or another value. For instance, a reference to h3 span.CEURLOCTIME returns Montpellier, France, May 26, 2013 and since there is no further html annotation, we cannot be more selective over the returned value. In these cases rml:process is used to define a regular expression, e.g. ([a-zA-Z]*), [a-zA-Z]*, [a-zA-Z]* [0-9]*, [0-9]*, and rml:replace is used to define the part of the value that is used for a certain mapping rule, e.g., $1, for the aforementioned case to map the city Montpellier. Furthermore, rml:split allows to split the value based on a delimiter and to map each part separately. The possibility to chain them enables even more fine-grained selections. These adjustments contribute in solving subtask 1.2.
Challenge-Specific Adjustments. In order to cope with a number of non-trivial structures of the challenge-specific html input sources, the default css3 selectors are not expressive enough. To this extent, we added the css3 function :until(x) to CSSelly16, a Java implementation of the w3c css3 specification, used by the rmlprocessor. This function matches the first x found element in the html document.
The structure of the index page does not allow to use the default css3 selectors to extract the required information. However, implementing a custom function is not possible in this case, due to the extensibility limitations of CSSelly. To this extent, we reformatted17 the index page to make it processable using the available selectors.
7 Query Evaluation
The execution of our publishing workflow is accomplished based on two tools: the rmlprocessor that is used to execute the mapping definitions and generate the rdf dataset and rdfunit that is used to validate and improve the quality of both the defined schema and the generated dataset. Besides the publishing workflow, we used another tool, the datatank to publish the sparql queries.
8.1 RML Processor
Our rmlprocessor 21, implemented in Java on top of db2triples22, was used to perform the mappings. The rmlprocessor follows the mapping-driven processing approach, namely it reads the mapping definitions as defined with rml, and executes the mapping rules to generate the corresponding rdf dataset. The rmlprocessor has a modular architecture where the extraction and mapping modules are executed independently of each other. When the rml mappings are processed, the mapping module deals with the mappings’ execution as defined in the mapping document in rml syntax, while the extraction module deals with the target languages expressions, in our case css3 expressions. To be more precise, the rmlprocessor uses CSSelly, a Java implementation of the w3c css3 specification.
rdfunit  is an rdf validation framework inspired by test-driven software development. In rdfunit, every vocabulary, ontology, dataset or application can be accompanied by a set of data quality Test Cases (tcs) that ensure a basic level of quality. Assigning tcs in ontologies results in tests that can be reused by datasets sharing the same schema. All tcs are executed as sparql queries using a pattern-based transformation approach. In our workflow, we use rdfunit to assure that (i) the mapping documents validate against the rml ontology, (ii) the schema, as a combination of several ontologies and vocabularies, is valid and (iii) the generated dataset does not contain violations in respect to the schema used.
8.3 The DataTank
the datatank 23 is a restful data management system written in php and maintained by okfn Belgium24. It enables publishing several data formats into Web readable formats. The source data can be stored in text based files, such as csv, xml and json, or in binary structures, such as shp files and relational databases. the datatank reads the data out of these files and/or structures and publishes them on the Web using a uri as an identifier. It can provide the data in any format depending on the users needs, independently of the original format. Next to publishing data, the datatank allows to publish (templated) sparql queries. sparql templates make it possible to define a variable’s value at runtime (by the user). As a result, those queries have improved reusability and their scope fits well in the challenge’s needs.
9 Discussion and Conclusion
It is beneficial that css3 selectors become part of a formalization that performs mappings of data in html. Considering that the rml processor takes care of executing the mappings while the css3 extractor parses the document, the data publishers’ contribution is limited in providing only the mapping document. As rml enables reusing same mappings over different files, the effort they put is even less. For the challenge, same mapping documents and/or definitions were re-used for different html input sources.
It is reasonable to consider css3 selectors to extract content from html pages because nowadays most websites use templates, formed with css3 selectors. Thus the content of their Web pages is structured in a similar way, which is the same point of reference as the one used by rml. This allows us to use rml mapping documents as a ‘translation layer’ over the published content of html pages.
Furthermore, as the mappings are partitioned in independent Triples Maps, data publishers can select the Triples Maps they want to execute at any time. For instance, in the case of the challenge, if violations were identified using the rdfunit because of incorrect mappings, we can isolate the Triples Map that generated those triples, correct the relevant mapping definitions and re-execute them, without affecting the rest mapping definitions or the overall dataset. This becomes even easier considering that the mappings in rml are defined as triples themselves and, thus, the triples’ provenance can be tracked and used to identify the mappings and data that cause the erroneous rdf result.
Beyond re-using the same mapping documents, rml allows to combine data from different input sources either they are in the same format or not. This leads to enhanced results as integration of data from different sources occurs during the mapping and relations between data appearing in different resources can be defined instead of interlinking them afterwards. For instance, the proceedings appearing in html can be mapped in an integrated fashion with the results of the extraction of the information from the pdf ’s of the papers published at the workshops, aligning with the results of Task 2. This results in enriching dataset when the two original datasets are combined.
Compared to last year’s submission, we made the following improvements: (i) more information was extracted from the index page, while we keep the volume mapping documents simpler; (ii) the information extraction was focused on answering the challenge’s queries; and (iii) series and workshops were modeled as separate entities, adding more semantic meaning to the resulting dataset; (iv) we use single mapping documents for multiple Web pages of the ceur-ws html input sources. These improvements occur thanks to the updated syntax and the more stable release of rmlprocessor, leading to a higher number of supported queries.
The described research activities were funded by Ghent University, iMinds, the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research Flanders (FWO Flanders), and the European Union.
- 1.Dimou, A., Vander Sande, M., Colpaert, P., De Vocht, L., Verborgh, R., Mannens, E., Van de Walle, R.: Extraction and semantic annotation of workshop proceedings in HTML using RML. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 114–119. Springer, Heidelberg (2014) Google Scholar
- 2.Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous data. In: Workshop on Linked Data on the Web (2014)Google Scholar
- 3.Dimou, A., Vander Sande, M., Slepicka, J., Szekely, P., Mannens, E., Knoblock, C., Van de Walle, R.: Mapping hierarchical sources into RDF using the RML mapping language. In: Proceedings of the 8th IEEE International Conference on Semantic Computing (2014)Google Scholar
- 4.Lange, C., Di Iorio, A.: Semantic publishing challenge – assessing the quality of scientific output. In: Presutti, V., et al. (eds.) SemWebEval 2014. CCIS, vol. 475, pp. 61–76. Springer, Heidelberg (2014) Google Scholar
- 5.Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language. In: Working group recommendation, W3C, September 2012. http://www.w3.org/TR/r2rml/
- 6.Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R., Zaveri, A.: Test-driven evaluation of linked data quality. In: Proceedings of the World Wide Web Conference, pp. 747–758 (2014)Google Scholar