Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A lot of information is available on the Web through websites. However, this information is not always processable by Semantic Web enabled systems, because most html pages lack the required metadata. An example of such a website is ceur-ws Workshop Proceedings (ceur-ws)Footnote 1. ceur-ws is a publication service for proceedings of scientific workshops. It provides (i) a list of all the volumes indexed in a single Web page; and (ii) a detailed Web page for each volume. In need of assessing the scientific output quality, the Semantic Publishing Challenge (spc14) was organized in 2014Footnote 2, followed by this year’s editionFootnote 3 (spc15).

In this paper, we propose a solution to solve the challenge’s first taskFootnote 4, which includes extracting information regarding workshops, their chairs and conference affiliations, as well as their papers and their authors, from a set of html-encoded tables of workshop proceedings volumes. In order to achieve this, we build on last year’s submission [1]. The solution uses the rdf Mapping language (rml)Footnote 5 [2, 3], which is a generic mapping language based on an extension over r2rml, the w3c standard for mapping relational databases into rdf. rml offers a uniform way of defining the mapping rules for data in heterogeneous formats.

We follow the same approach as last year. However, we (i) address a number of shortcomings, (ii) assess the generated dataset for its quality and (iii) publish the queries as sparql query templates. This is accomplished using rml  (see Sect. 4) to define the mappings, the rmlprocessor to execute them, the rdfunit to both validate the mapping documents and assess the generated dataset’s quality (see Sect. 8.2), and the datatank to publish the sparql query templates (see Sect. 8.3).

This paper that supports our submission to the spc15 is structured as follows: we state the problem in Sect. 2, and give an overview of our approach in Sect. 3. In Sect. 4 we elaborate on the basis of the solution, namely  rml. After defining how the data is modeled in Sect. 5, we elaborate on how the mapping is done in Sect. 6. We discuss how the queries of the task are evaluated in Sect. 7. In Sect. 8 we explain the used tools: rmlprocessor  (Sect. 8.1), the rdfunit  (Sect. 8.2) and the datatank  (Sect. 8.3). Finally, in Sect. 9, we discuss our solution and its results, after which we form our conclusions.

2 Problem Statement

The conclusions of the Semantic Publishing Challenge 2014 [4] show that the submitted solutions provided satisfying results. However, they also highlight that there is still room for improvement. With the Semantic Publishing Challenge 2015, the organizers continue pursuing the objective of assessing the quality of scientific output and of evolving the dataset bootstrapped in 2014 to take also into account the wider ecosystem of publications. The challenge consists of the following three tasks:

  • Task 1 Extraction and assessment of workshop proceedings information,

  • Task 2 Extracting contextual information from the papers text in pdf, and

  • Task 3 Interlinking

In this paper we explain how we tackle the first task of the challenge. The participants are asked to extract information from a set of html tables published as Web pages in the ceur-ws workshop proceedings. The information is obtained from the html pages’ content which is semantically annotated and represented using the rdf data model. The extracted information is expected to answer queries about the quality of these workshops, for instance by measuring growth, longevity, and so on. The task is an extension of the spc14 ’s first task. The most challenging quality indicators from last year’s challenge are reused. However, a number of them are defined more precisely, and new indicators are added. This results in the following three subtasks:

SubTask 1.1 :

 Extract information from the html input pages;

SubTask 1.2 :

Annotate the information with appropriate ontologies and vocabularies; and

Subtask 1.3 :

 Publish the semantically enriched representation with the rdf data model.

3 Overview of Our Approach

Our approach includes: (i) the generation of the rdf dataset and (ii) the evaluation of the sparql queries. The first is achieved with the following workflow:

  1. 1.

    define the mapping documents, using rml;

  2. 2.

    assess the mapping documents, using the rdfunit;

  3. 3.

    generate the dataset, by executing the mappings, using the rmlprocessor;

  4. 4.

    assess the quality of the dataset, using therdfunit, and

  5. 5.

    publish the dataset, using the datatank.

After the generation of the rdf dataset, the queries of the task are evaluated (see Sect. 7). In order to achieve this, the following are considered:

  1. 1.

    define the queries, using sparql templates, using the datatank,

  2. 2.

    instantiate and execute the sparql queries, and

  3. 3.

    provide the results.

The components and output of our solution and where they can be found are summarized in Table 1.

Table 1. Submission’s output

4 RML

rdf Mapping Language (rml) [2, 3] is a generic language defined to express customized mapping rules from data in heterogeneous formats to the rdf data model. rml is defined as a superset of the w3c-standardized mapping language r2rml  [5], extending its applicability and broadening its scope. rml keeps the mapping definitions as in r2rml and follows the same syntax, providing a generic way of defining the mappings that is easily transferable to cover references to other data structures, combined with case-specific extensions, making rml highly extensible towards new source formats.

4.1 Structure of an RML Mapping Document

In rml, the mapping to the rdf data model is based on one or more Triples Maps that define how rdf triples should be generated. A Triples Map consists of three main parts: (i) the Logical Source (rml:LogicalSource), (ii) the Subject Map, and (iii) zero or more Predicate Object Maps.

The Subject Map (rr:SubjectMap) defines the rule that generates unique identifiers (uris) for the resources which are mapped and is used as the subject of all rdf triples generated from this Triples Map. A Predicate Object Map consists of Predicate Maps, which define the rule that generates the triple’s predicate and Object Maps or Referencing Object Maps, which define the rule that generates the triple’s object. The Subject Map, the Predicate Map and the Object Map are Term Maps, namely rules that generate an rdf term (an iri, a blank node or a literal).

4.2 Leveraging HTML with RML

A Logical Source (rml:LogicalSource) is used to determine the input source with the data to be mapped. rml deals with different data serializations which use different ways to refer to their content. Thus, rml considers that any reference to the Logical Source should be defined in a form relevant to the input data, e.g., xpath for xml files or jsonpath for json files. The Reference Formulation (rml:referenceFormulation) indicates the formulation (for instance, a standard or a query language) to refer to its data. Any reference to the data of the input source must be valid expressions according to the Reference Formulation stated at the Logical Source. This makes rml highly extensible towards new source formats.

At the current version of rml, the ql:CSV, ql:XPath, ql:JSONPath and ql:CSS3 Reference Formulations are predefined (where ql is the prefix for http://semweb.mmlab.be/ns/ql). For the task we use the ql:CSS3 Reference Formulation to access the elements within the document. css3 Footnote 6 selectors are standardized by w3c, they are easily used and broadly-known as they are used for selecting the html elements both for cascading styles and for jQueryFootnote 7. css3 selectors can be used to refer to data in html documents. However, they can also be used for xml documents.

Fig. 1.
figure 1

An overview of the interaction between the classes and properties used to model the workshops proceedings information.

5 Data Modeling

In order to model the workshop proceedings information, we use the following ontologies:

  • The Bibliographic OntologyFootnote 8 (with prefix bibo),

  • DCMI Metadata TermsFootnote 9 (with prefix dcterms),

  • Friend of a FriendFootnote 10 (with prefix foaf),

  • RDF SchemaFootnote 11 (with prefix rdfs),

  • FRBR-aligned Bibliographic OntologyFootnote 12 (with prefix fabio)

  • The Event OntologyFootnote 13 (with prefix event)

  • Semantic Web for Research CommunitiesFootnote 14 (with prefix swrc)

The classes used to determine the type of the entities are denoted in Table 2.

The properties used to annotate the entities and determine the relationships among them are denoted in Table 3. The properties listed here are not exhaustive, and for a complete overview of the used properties we refer to the mapping documentsFootnote 15. An overview of the entities and the relationships between the entities and the properties that determine them is shown in Fig. 1. Overall, the modelling of the data is driven by the queries that need to be answered as part of the challenge.

We extracted information related to workshop (bibo:Workshop) entities from the index page. Furthermore, we extracted information that models the relationship among different workshops (rdfs:seeAlso) of the same series, that denotes which proceedings are presented at a workshop (bibo:presentedAt) and states the conference that the workshop was co-located with (dcterms:isPartOf). To determine the workshops we iterated over the volumes, because, except for the joint volumes, all of them represent a separate workshop. Finally, the workshops related to the current one are added by following the ‘see also’ links in its description.

Each volume page represents a proceedings entity (bibo:Proceedings). This html page contains information about the papers (swrc:InProceedings, bibo:Document), which are connected to the proceedings (bibo:hasPart). We make a distinction between non-invited and invited papers (using fabio:supplement instead of bibo:hasPart). The authors (foaf:Person, dcterms:Agent) are defined (using dcterms:creator) of each paper, as well as the editors (foaf:Person, dcterms:Agent) of the proceedings (dcterms:editor). Finally, from the workshop’s name its series (bibo:Series) is determined and the workshop’s co-located event (bibo:Conference) is determined (using event:sub_event). The extraction of additional information (location, date, edition), annotated with datatype properties, is defined in the mapping documents. Due to the repetitive nature of the corresponding definitions, we refer to the mapping documents for more details.

Table 2. Classes
Table 3. Properties

6 Mapping CEUR-WS from HTML to RDF

The task refers to two types of html pages that serve as input. On the one hand it is the index page listing all the volumes, namely http://ceur-ws.org. On the other hand, for each volume there is an html page that contains more detailed information, e.g., http://ceur-ws.org/Vol-1165/.

6.1 Defining the Mappings

Excerpts of a mapping document for one of the volumes are indicatively presented. First, the input source (Listing 1.1, line 5) that is used by this Triples Map (Listing 1.1, line 4) is stated, together with the Reference Formulation, in this case the css3 selectors (Listing 1.1, line 7), that states how we refer to the input and the iterator (Listing 1.1, line 6) over which the iteration occurs, as in Listing 1.1:

figure a

To define how the subject of all rdf triples will be generated using this Triples Map (Listing 1.2, line 4), we define a Subject Map (Listing 1.2, line 5). A unique uri will be generated for each volume with the volume number that is present on each page. This number is addressable by the css3 expression span.CEURVOLNR (Listing 1.2, line 6). The class of the workshop is set to swrc:Proceedings (Listing 1.2, line 7). The definition of a complete Subject Map can be found in Listing 1.2:

figure b

For each rdf triple of the volume we need to define a Predicate Object Map (Listing 1.3, line 7). In our example (see Listing 1.3), we add the predicate for the label (rdfs:label) to the volume (Listing 1.3, line 6). The value of the object is specified as the content of the link (<a>) inside the <span> with the class CEURVOLTITLE, which results in the css3 selector span.CEURVOLTITLE a (Listing 1.3, line 8). The definition of a complete Subject Map is indicatively presented at Listing 1.3:

figure c

For the object’s generation, rml is not limited to literals, as in the previous example. A reference to another Triples Map (Listing 1.4, line 8), instead of an rml:reference, is used to generate resources instead of literal values. In Listing 1.4, we state that all subjects of <#EditorMapping> are editors (bibo:editor) of the volume:

figure d

6.2 Executing the Mappings

Executing an rml mapping requires a mapping document that summarizes all Triples Maps and points to an input data source. The mapping document is executed by an rml processor and the corresponding rdf output is generated. Each Triples Map is processed and the defined Subject Map and Predicate Object Maps are applied to the input data. For each reference to the input html, the css3 extractor returns an extract of the data and the corresponding triples are generated. The resulting rdf can be exporting in a user-specified serialization format. This solves subtask 1.3.

Data cleansing is out of rml ’s scope. However, the values extracted from the input is not always exactly as desired to be represented in rdf and the situation aggravates when mapping e.g. live html documents on-the-fly, where neither pre-processing is possible nor being as selective as desired purely based on css3 expressions to retrieve extracts from html pages. To this end, we defined and used rml:process, rml:replace and rml:split to further process the values returned from the input source as defined within a mapping rule. To be more precise, rml:process and rml:replace were used to define regular expressions whenever it is required to be more selective over the returned value and replaced by a part of the value or another value. For instance, a reference to h3 span.CEURLOCTIME returns Montpellier, France, May 26, 2013 and since there is no further html annotation, we cannot be more selective over the returned value. In these cases rml:process is used to define a regular expression, e.g. ([a-zA-Z]*), [a-zA-Z]*, [a-zA-Z]* [0-9]*, [0-9]*, and rml:replace is used to define the part of the value that is used for a certain mapping rule, e.g., $1, for the aforementioned case to map the city Montpellier. Furthermore, rml:split allows to split the value based on a delimiter and to map each part separately. The possibility to chain them enables even more fine-grained selections. These adjustments contribute in solving subtask 1.2.

Challenge-Specific Adjustments. In order to cope with a number of non-trivial structures of the challenge-specific html input sources, the default css3 selectors are not expressive enough. To this extent, we added the css3 function :until(x) to CSSellyFootnote 16, a Java implementation of the w3c  css3 specification, used by the rmlprocessor. This function matches the first x found element in the html document.

The structure of the index page does not allow to use the default css3 selectors to extract the required information. However, implementing a custom function is not possible in this case, due to the extensibility limitations of CSSelly. To this extent, we reformattedFootnote 17 the index page to make it processable using the available selectors.

Last, a number of html pages contain invalid html syntax. To cope with this, we used JTidyFootnote 18 to produce valid versions of the html pagesFootnote 19. These adjustments allow to solve subtask 1.1.

7 Query Evaluation

The queries for Task 1 of the challenge can be found at https://github.com/ceurws/lod/wiki/QueriesTask1. Based on the description of each query, we created the corresponding sparql queries based on our data model (Sect. 5). Because of the queries templated nature, we defined our queries as sparql templatesFootnote 20 and published them using the datatank  (sect. 8.3), allowing easy access to the queries for different values. For example, the sparql template for the query 1.1 can be found in Listing 1.5. It is the same as the original query with exception of line 9, where $workshop is added. If we want to execute the query with the value Vol-1085 for the variable workshop, we consider the following uri http://rml.io/data/spc2015/tdt/queries/q01.json?workshop=Vol-1085. This returns the results of the query in json format.

figure e

8 Tools

The execution of our publishing workflow is accomplished based on two tools: the rmlprocessor that is used to execute the mapping definitions and generate the rdf dataset and rdfunit that is used to validate and improve the quality of both the defined schema and the generated dataset. Besides the publishing workflow, we used another tool, the datatank to publish the sparql queries.

8.1 RML Processor

Our rmlprocessor Footnote 21, implemented in Java on top of db2triplesFootnote 22, was used to perform the mappings. The rmlprocessor follows the mapping-driven processing approach, namely it reads the mapping definitions as defined with rml, and executes the mapping rules to generate the corresponding rdf dataset. The rmlprocessor has a modular architecture where the extraction and mapping modules are executed independently of each other. When the rml mappings are processed, the mapping module deals with the mappings’ execution as defined in the mapping document in rml syntax, while the extraction module deals with the target languages expressions, in our case css3 expressions. To be more precise, the rmlprocessor uses CSSelly, a Java implementation of the w3c  css3 specification.

8.2 RDFUnit

rdfunit [6] is an rdf validation framework inspired by test-driven software development. In rdfunit, every vocabulary, ontology, dataset or application can be accompanied by a set of data quality Test Cases (tcs) that ensure a basic level of quality. Assigning tcs in ontologies results in tests that can be reused by datasets sharing the same schema. All tcs are executed as sparql queries using a pattern-based transformation approach. In our workflow, we use rdfunit to assure that (i) the mapping documents validate against the rml ontology, (ii) the schema, as a combination of several ontologies and vocabularies, is valid and (iii) the generated dataset does not contain violations in respect to the schema used.

8.3 The DataTank

the datatank Footnote 23 is a restful data management system written in php and maintained by okfn BelgiumFootnote 24. It enables publishing several data formats into Web readable formats. The source data can be stored in text based files, such as csv, xml and json, or in binary structures, such as shp files and relational databases. the datatank reads the data out of these files and/or structures and publishes them on the Web using a uri as an identifier. It can provide the data in any format depending on the users needs, independently of the original format. Next to publishing data, the datatank allows to publish (templated) sparql queries. sparql templates make it possible to define a variable’s value at runtime (by the user). As a result, those queries have improved reusability and their scope fits well in the challenge’s needs.

9 Discussion and Conclusion

It is beneficial that css3 selectors become part of a formalization that performs mappings of data in html. Considering that the rml processor takes care of executing the mappings while the css3 extractor parses the document, the data publishers’ contribution is limited in providing only the mapping document. As rml enables reusing same mappings over different files, the effort they put is even less. For the challenge, same mapping documents and/or definitions were re-used for different html input sources.

It is reasonable to consider css3 selectors to extract content from html pages because nowadays most websites use templates, formed with css3 selectors. Thus the content of their Web pages is structured in a similar way, which is the same point of reference as the one used by rml. This allows us to use rml mapping documents as a ‘translation layer’ over the published content of html pages.

Furthermore, as the mappings are partitioned in independent Triples Maps, data publishers can select the Triples Maps they want to execute at any time. For instance, in the case of the challenge, if violations were identified using the rdfunit because of incorrect mappings, we can isolate the Triples Map that generated those triples, correct the relevant mapping definitions and re-execute them, without affecting the rest mapping definitions or the overall dataset. This becomes even easier considering that the mappings in rml are defined as triples themselves and, thus, the triples’ provenance can be tracked and used to identify the mappings and data that cause the erroneous rdf result.

Beyond re-using the same mapping documents, rml allows to combine data from different input sources either they are in the same format or not. This leads to enhanced results as integration of data from different sources occurs during the mapping and relations between data appearing in different resources can be defined instead of interlinking them afterwards. For instance, the proceedings appearing in html can be mapped in an integrated fashion with the results of the extraction of the information from the pdf ’s of the papers published at the workshops, aligning with the results of Task 2. This results in enriching dataset when the two original datasets are combined.

Compared to last year’s submission, we made the following improvements: (i) more information was extracted from the index page, while we keep the volume mapping documents simpler; (ii) the information extraction was focused on answering the challenge’s queries; and (iii) series and workshops were modeled as separate entities, adding more semantic meaning to the resulting dataset; (iv) we use single mapping documents for multiple Web pages of the ceur-ws  html input sources. These improvements occur thanks to the updated syntax and the more stable release of rmlprocessor, leading to a higher number of supported queries.