Linked Data Finland: A 7-star Model and Platform for Publishing and Re-using Linked Datasets
The idea of Linked Data is to aggregate, harmonize, integrate, enrich, and publish data for re-use on the Web in a cost-efficient way using Semantic Web technologies. We concern two major hindrances for re-using Linked Data: It is often difficult for a re-user to (1) understand the characteristics of the dataset and (2) evaluate the quality the data for the intended purpose. This paper introduces the “Linked Data Finland” platform LDF.fi addressing these issues. We extend the famous 5-star model of Tim Berners-Lee, with the sixth star for providing the dataset with a schema that explains the dataset, and the seventh star for validating the data against the schema. LDF.fi also automates data publishing and provides data curation tools. The first prototype of the platform is available on the web as a service, hosting tens of datasets and supporting several applications.
KeywordsLink Data SPARQL Query Link Open Data Metadata Schema SPARQL Endpoint
1 Publishing Linked Data
Lots of Linked Data (LD) platforms have emerged on the Web since the publication of the four Linked Data publication principles and the 5-star model1. For example, in Life Sciences alone there are LinkedLifeData2, NeuroCommons3, Chem2Bio2RDF4, HCLSIG/LODD5, BioLOD6, and Bio2RDF7.
LDF.fi8 contributes to the current state-of-the-art of Linked Data publishing  as follows: (1) We propose extending the 5-star model9 into a 7-star model, with the goal of encouraging data publishers to provide their data with explicit metadata schemas and to validate their data for better quality. (2) LDF.fi automates the data publishing process so that not only a SPARQL endpoint but also a rich set of additional data services are generated automatically based on the metadata about the dataset and its graphs. (3) LDF.fi provides end users with additional tools and documentation for publishing, curating, and re-using the datasets. This paper first explains these ideas, and then presents the actual service available online10.
2 7-star Linked Data
A major hindrance of re-using a dataset is the difficulty to evaluate how suitable the data is for the application purpose at hand. Datasets often use schemas (vocabularies) for which definitions or descriptions are not available, but are embedded in the data itself. This makes it difficult to figure out the characteristics of the data. Furthermore, given the data and its schema it may be difficult to say how well the data actually matches the schema; there are lots of data quality problems on the Semantic Web11.
The 6th star is given if the schemas (vocabularies) used in the dataset are explicitly described and published alongside the dataset, unless the schemas are already available somewhere on the Web.
For the 7th star, the quality of the dataset against the schemas used in it must be explicated, so that the user can evaluate whether the data quality matches her needs.
3 Automatic Service Generation
Links for downloading datasets and graphs are provided (if licensing permits it).
Schemas can be downloaded if provided with the data, and links to their documentation are provided (when available).
Following forms are created for inspecting the dataset in more detail: (1) Given a URI the corresponding RDF description can be read in various formats (Turtle, RDF/XML, RDF/JSON, N3, N-triples) for human consumption in a browser. The example URI is used as a first choice to try out. (2) Given a URI, Linked Data browsing can be started from it, with the example URI as a starting point.
There is a SPARQL query form for querying the service with the given query used as a first example.
Links providing Vocab.at analysis reports of the graphs in the dataset are provided. They tell the end-user what schemas (vocabularies) are used in the data, and how they have been used. Issues on data quality are pointed out.
SPARQL Service Descriptions of the datasets are provided, if available. LDF uses W3C SPARQL Service Description recommendation for this.
Links to visualizations of the data that may give the re-user more insight on how the dataset can be used in applications.
Licensing conditions of the dataset are provided as well as a label of 1–7 stars.
4 Data Curation Tools
SeCo Lexical Analysis Services20 can be used for language recognition, lemmatization, morphological analysis, inflected form generation, and hyphenation.
ARPA Automatic Text Annotation System21 can be used for extracting Linked Data from unstructured texts.
SAHA22 tool can be used for investigating and editing LDF.fi datasets interactively in real time. In LDF.fi we modified and extended SAHA to work on top of any standard SPARQL endpoint. SAHA is now used as a Linked Data Browser in LDF.fi in the same vein as, e.g., URIBurner23. Using SAHA as an editor service for a dataset requires permission from the LDF.fi team.
5 The Service
In addition to dataset home pages, the LDF.fi portal includes the following pages available through menu links: Project page describes the underlying national Linked Data Finland initiative; Datasets lists the datasets in the system and links to their home pages; Schemas lists the schemas in the system; Services explains what kind of services LDF.fi provides; Policies documents URI minting and licensing policies in use; Documentation explains dataset documentation features of the portal; Validation explains dataset validation features of the portal; Applications lists application examples of the portal datasets; Your Data? tells how external users can get their data published in LDF.fi.
The first datasets available in LDF.fi include: Finnish DBpedia as a service; various Cultural Heritage datasets including, e.g., BookSampo, whose deployed end-user application25 has 65,000 monthly users; history datasets Semantic National Biography (6,300 biographies as Linked Data) and events of World War I (in collaboration with University of Colorado Boulder); Finnish Law first time as Linked Open Data; Aalto University Linked Open Data26; two Linked Science datasets about ornithological observations and weather data; various ontologies used by the ONKI Ontology Service27; a linked news dataset. The LDF.fi service is implemented using a combination of the Fuseki SPARQL server28 for serving primary data, and the Varnish web application accelerator29 for routing URIs to pertinent applications as well as content negotiation.
6 Evaluation in a Living Lab Environment
LDF.fi was opened officially in January 2014. The platform is being evaluated by providing the service in an open Living Laboratory environment for data publishers and application developers. References to first data applications can be found in the applications page of the portal30.
- 1.Alonen, M., Kauppinen, T., Hyvönen, E.: Vocab.at - automatic linked data documentation and vocabulary usage analysis, manuscript (2013). http://www.seco.tkk.fi/publications/submitted/alonen-et-al-vocab.pdf