Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Publishing Linked Data

Lots of Linked Data (LD) platforms have emerged on the Web since the publication of the four Linked Data publication principles and the 5-star modelFootnote 1. For example, in Life Sciences alone there are LinkedLifeDataFootnote 2, NeuroCommonsFootnote 3, Chem2Bio2RDFFootnote 4, HCLSIG/LODDFootnote 5, BioLODFootnote 6, and Bio2RDFFootnote 7.

LDF.fiFootnote 8 contributes to the current state-of-the-art of Linked Data publishing [2] as follows: (1) We propose extending the 5-star modelFootnote 9 into a 7-star model, with the goal of encouraging data publishers to provide their data with explicit metadata schemas and to validate their data for better quality. (2) LDF.fi automates the data publishing process so that not only a SPARQL endpoint but also a rich set of additional data services are generated automatically based on the metadata about the dataset and its graphs. (3) LDF.fi provides end users with additional tools and documentation for publishing, curating, and re-using the datasets. This paper first explains these ideas, and then presents the actual service available onlineFootnote 10.

2 7-star Linked Data

A major hindrance of re-using a dataset is the difficulty to evaluate how suitable the data is for the application purpose at hand. Datasets often use schemas (vocabularies) for which definitions or descriptions are not available, but are embedded in the data itself. This makes it difficult to figure out the characteristics of the data. Furthermore, given the data and its schema it may be difficult to say how well the data actually matches the schema; there are lots of data quality problems on the Semantic WebFootnote 11.

To address these issues, we encourage data publishers by two extra stars:

  • The 6th star is given if the schemas (vocabularies) used in the dataset are explicitly described and published alongside the dataset, unless the schemas are already available somewhere on the Web.

  • For the 7th star, the quality of the dataset against the schemas used in it must be explicated, so that the user can evaluate whether the data quality matches her needs.

LDF.fi provides supporting tools related to these issues: First, schemas are documented automatically for the human reader by using a schema documentation generator. In our case, the LODEFootnote 12 online service is employed. (Other possible tools for schema documentation include SpecGen, NeologismFootnote 13, dowlFootnote 14, ParrotFootnote 15, OWLDocFootnote 16, and OntologyBrowserFootnote 17.) Second, in order to find out how schemas are actually used in a dataset, we created a new service http://vocab.at [1]. It analyses a dataset, creates an HTML report that explains vocabulary usage in the data, and reports issues of undefined properties or unresolvable namespaces. The input for vocab.at is either an RDF file, a SPARQL endpoint, or an HTML page with embedded RDFa markup.

3 Automatic Service Generation

LDF.fi tries to automate the process of publishing datasets as far as possible in the following way: The publisher is expected to create an RDF dataset with minimal metadata about it and its schemas. Here an extended version of the new W3C Service Description recommendationFootnote 18 and the VoID vocabularyFootnote 19 can be used, and the data is stored into the SPARQL endpoint. Alternatively, a simple JSON object listing the dataset and graph names, human readable labels, and a description of the data can be provided. In the metadata, it is also possible to give an example URI pointing into the dataset, a SPARQL query example for querying the data, and optionally a link to possible visualizations of the dataset. Based on such metadata, LDF.fi generates for each dataset a home page on which the following functionalities are available for re-users:

  1. 1.

    Links for downloading datasets and graphs are provided (if licensing permits it).

  2. 2.

    Schemas can be downloaded if provided with the data, and links to their documentation are provided (when available).

  3. 3.

    Following forms are created for inspecting the dataset in more detail: (1) Given a URI the corresponding RDF description can be read in various formats (Turtle, RDF/XML, RDF/JSON, N3, N-triples) for human consumption in a browser. The example URI is used as a first choice to try out. (2) Given a URI, Linked Data browsing can be started from it, with the example URI as a starting point.

  4. 4.

    There is a SPARQL query form for querying the service with the given query used as a first example.

  5. 5.

    Links providing Vocab.at analysis reports of the graphs in the dataset are provided. They tell the end-user what schemas (vocabularies) are used in the data, and how they have been used. Issues on data quality are pointed out.

  6. 6.

    SPARQL Service Descriptions of the datasets are provided, if available. LDF uses W3C SPARQL Service Description recommendation for this.

  7. 7.

    Links to visualizations of the data that may give the re-user more insight on how the dataset can be used in applications.

  8. 8.

    Licensing conditions of the dataset are provided as well as a label of 1–7 stars.

4 Data Curation Tools

Data curation refers to activities and processes done to create, manage, maintain, and validate data. In LDF.fi several data curation services are available for analyzing textual data and for creating semantic annotations (semi-)automatically from them:

  1. 1.

    SeCo Lexical Analysis ServicesFootnote 20 can be used for language recognition, lemmatization, morphological analysis, inflected form generation, and hyphenation.

  2. 2.

    ARPA Automatic Text Annotation SystemFootnote 21 can be used for extracting Linked Data from unstructured texts.

  3. 3.

    SAHAFootnote 22 tool can be used for investigating and editing LDF.fi datasets interactively in real time. In LDF.fi we modified and extended SAHA to work on top of any standard SPARQL endpoint. SAHA is now used as a Linked Data Browser in LDF.fi in the same vein as, e.g., URIBurnerFootnote 23. Using SAHA as an editor service for a dataset requires permission from the LDF.fi team.

In our work, we are also using some external tools, such as the SILK FrameworkFootnote 24 for linking data.

5 The Service

In addition to dataset home pages, the LDF.fi portal includes the following pages available through menu links: Project page describes the underlying national Linked Data Finland initiative; Datasets lists the datasets in the system and links to their home pages; Schemas lists the schemas in the system; Services explains what kind of services LDF.fi provides; Policies documents URI minting and licensing policies in use; Documentation explains dataset documentation features of the portal; Validation explains dataset validation features of the portal; Applications lists application examples of the portal datasets; Your Data? tells how external users can get their data published in LDF.fi.

The first datasets available in LDF.fi include: Finnish DBpedia as a service; various Cultural Heritage datasets including, e.g., BookSampo, whose deployed end-user applicationFootnote 25 has 65,000 monthly users; history datasets Semantic National Biography (6,300 biographies as Linked Data) and events of World War I (in collaboration with University of Colorado Boulder); Finnish Law first time as Linked Open Data; Aalto University Linked Open DataFootnote 26; two Linked Science datasets about ornithological observations and weather data; various ontologies used by the ONKI Ontology ServiceFootnote 27; a linked news dataset. The LDF.fi service is implemented using a combination of the Fuseki SPARQL serverFootnote 28 for serving primary data, and the Varnish web application acceleratorFootnote 29 for routing URIs to pertinent applications as well as content negotiation.

6 Evaluation in a Living Lab Environment

LDF.fi was opened officially in January 2014. The platform is being evaluated by providing the service in an open Living Laboratory environment for data publishers and application developers. References to first data applications can be found in the applications page of the portalFootnote 30.