The LOD2 Linked Data stack is a distribution platform for software components which support one or more aspects of the Linked Data life cycle. Each package contains a pre-configured component that on installation results in a ready-to-use application. The pre-configuration ensures that the deployed components are able to interact with each other. The system architecture of the deployed Linked data stack components is explained in Sect. 2.1. The subsequent sections provide more details on the distribution platform and what the requirements are for software to take part of it. Finally we provide an overview of the LOD2 stack contents in Sect. 2.5.
2.1 Building a Linked Data Application
The LOD2 stack facilitates the creation of Linked Data applications according to a prototypical system architecture. The system architecture is shown below in Fig. 1. From top to bottom, one has first the application layer with which the end-user is confronted. The applications are built with components from the component layer. These components communicate between each other and interact with the data in the data layer via the common data access layer.
The data access layer is build around the data representation framework RDF. The data is exchanged in RDF and retrieved with SPARQL queries from SPARQL end-points. All data format heterogeneity is hidden for the components by this layer. This yields an uniform data view easing the configuration of the data flow between the components. The RDF representation yields important advantages making it suited for the role as common data representation formalism. It is a W3C standard, domain neutral, and it is web enabled: all identifiers are web addressable. And last but not least, data integration starts with just merging the data together in one store.
Matching the Linked Data life cycle presented in the introduction to the system architecture shows that the extraction and storage tools feature in the data layer and most of the other are part of the component layer.
An application end user will seldom be directly in touch with the underlying data layer. They are offered an application interface that shows the information in a domain adapted intuitive interface. Few of the LOD2 stack browsing and exploration components have been lifted and designed to this stage. Most of the stack components are designed for the (Linked Data) data manager. The components provide user interfaces that aid the data manager in its task. For instance, the SILK workbench is a user interface for creating linking specifications. This specification can then be used by the silk engine which might be embedded in a larger application. That is the task of the last targeted audience: the Linked Data application developer.
2.2 Becoming LOD2 Linked Data Stack Component
The system architecture defines minimal requirements for components to become part of the LOD2 stack, they are that
Information is exchanged in RDF format
Business information is requested through SPARQL endpoint access
Updates are provided as SPARQL updates
A typical example is the SILK workbench. The business data is retrieved via querying SPARQL endpoints and the result (the link set) can be stored in a SPARQL endpoint that is open to updates. Of course these requirements do not hold for all communication channels of a component, for instance, extraction components such as D2R. Those obviously have to communicate with the original source of the information which is different than RDF, but the outcome is RDF.
A component is distributable via the LOD2 stack repository if it is provided as a Debian package that installs on Ubuntu 12.04 LTS. The Debian packaging system has been chosen because it is a well established software packaging system that is used by the popular Linux distribution Ubuntu. We decided for a reference OS release to ensure quality and reduce the maintenance efforts of the components. This choice does not limit the deployment of the software on other Linux distributions, whether they are Debian based or not. Using a tool called alien, debian packages can be installed on a RedHat distribution.
The above mentioned requirements ensure there is a common foundation between all components in the LOD2 stack. For application developers trained in the Linked Data domain the creation of an information flow is hence always possible. Because each data publishing editorial process is unique and the best solution depends on the data publishing organizations needs, the stack does not aim for a homogeneous LOD2 Linked Data publishing platform where all components are hidden behind a single consistent user interface. The goal is however on improving the information flow between components. In the first place, this is done via making sure that deployed components have access to each others output. Additionally, LOD2 has contributed supporting APIs and vocabularies to the component owners. If they extend their components with these, the data flow between the components will be further improved.
Using this approach, each component still has its individual identity but they are interoperable with each other.
2.3 LOD2 Stack Repository
In order to guarantee stability of the available component stack, a multi stage environment has been setup. There exist 3 stages currently:
The developers’ area: here the developers put their packages.
The testing stage: this is a collection of LOD2 packages that are subject to integration tests. The goal is to detect with automatic testing problems in the installation of the packages.
The stable stage: this is a collection of LOD2 packages that pass the tests.
The LOD2 stack managers are responsible for moving packages from the developers’ area into the testing stage and then to the stable stage.
Orthogonally we have created 2 repositories that are OS-release dependent. These contain components that are dependent on the OS. The typical example is Virtuoso for which Ubuntu 12.04 64 bit builds are provided. Build and installation instructions are present to support more recent and other Linux distributions.
Developers have to contribute a Debian package with the correct LOD2 stack configuration. The rationale behind this choice for this approach is to distribute the knowledge on building packages to all partners, but more important to create awareness for software deployment. When the component owners are responsible for building the Debian packages they face the imperfections that make their software hard to deploy. That experience has as positive effect that deployment issues are tackled early on. All necessary information for the developers is collected in the contribution documentationFootnote 2.
The LOD2 stack repository distributes software components. However in addition to these components, there are more components or information sources valuable to the LOD2 stack available online. These come in two categories:
software components which are only accessible online due to various reasons: special setup, license rules, etc. Examples are the Sindice search engine and PoolParty.
information sources: for example dbpedia.org and vocabulary.wolterskluwer.de.
2.4 Installing the LOD2 Linked Data Stack
The installation of a local version of the LOD2 stackFootnote 3 is done in a few steps and are available onlineFootnote 4. In short, the next steps must be executed:
Setup a Ubuntu 12.04 64 bit LTS system.
Download the repository package (of the stage of interest) and install it.
Update the local package cache.
Install the whole LOD2 stack or selected components.
If during the installation issues occur, firstname.lastname@example.org can be contacted for assistance. Background information and frequently occurring situations are documented at documentation wikiFootnote 5.
2.5 The LOD2 Linked Data Stack Release
In the following paragraphs all LOD2 stack components are summarized. First we list those that are available as Debian package, followed by those that are online available and finally we provide a table of online data sources that are of interest and have been supported by the LOD2 project.
2.5.1 Available as Debian Packages
colanut, limes (ULEI). LIMES is a link discovery framework for the Web of Data. It implements time-efficient approaches for large-scale link discovery based on the characteristics of metric spaces. The COLANUT (COmplex Linking in A NUTshell) interface resides on top of LIMES and supports the user during the link specification process.
d2r, d2r-cordis (UMA). D2R Server is a tool for publishing relational databases on the Semantic Web. It enables RDF and HTML browsers to navigate the content of the database, and allows applications to query the database using the SPARQL query language. The d2r-cordis package contains an example setup.
dbpedia-spotlight-gui, dbpedia-spotlight (UMA/ULEI). DBpedia Spotlight aims to shed light on the Web of Documents. It recognizes and disambiguates DBpedia concepts/entities mentioned in natural language texts. There is an online instance that can be used for experimenting.
dl-learner-core, dl-learner-interfaces (ULEI). The DL-Learner software learns concepts in Description Logics (DLs) from examples. Equivalently, it can be used to learn classes in OWL ontologies from selected objects. It extends Inductive Logic Programming to Descriptions Logics and the Semantic Web. The goal of DL-Learner is to provide a DL/OWL based machine learning tool to solve supervised learning tasks and support knowledge engineers in constructing knowledge and learning about the data they created.
libjs-rdfauthor (ULEI). RDFauthor is an editing solution for distributed and syndicated structured content on the World Wide Web. The system is able to extract structured information from RDFa-enhanced websites and to create an edit form based on this data.
LODRefine (Zemanta). Open Refine (http://openrefine.org/) extended with RDF (http://refine.deri.ie/) and Zemanta API (http://developer.zemanta.com) extensions.
lod2demo (TenForce). The LOD2 demonstrator is a web application which brings together all LOD2 stack components in one interface. Components are loosely coupled through the Virtuoso store via which all information is exchanged. It also serves as the top level meta package in order to install the whole stack content on a machine. This is the top level meta package which installs the whole stack on a machine.
lod2-statistical-workbench (TenForce/IMP). A web interface that aggregates several components of the stack organized in an intuitive way to support the specific business context of the statistical office. The workbench contains several dedicated extension for the manipulation of RDF data according to the Data Cube vocabulary: validation, merging and slicing of cubes are supported. Also the workbench has been used to explore authentication via WebID and keeping track of the data manipulations via a provenance trail.
lod2webapi (TenForce). An REST API allowing efficient graph creation and deletion as well as regex based querying. It also support a central prefix management.
unifiedviews (SWCG). UnifiedViews is Linked (Open) Data Management Suite to schedule and monitor required tasks (e.g. preform reoccurring extraction, transformation and load processes) for smooth and efficient Linked (Open) Data Management to support web-based Linked Open Data portals (LOD platforms) as well as sustainable Enterprise Linked Data integrations inside of organisations.
ontowiki, ontowiki-common, ontowiki-mysql, ontowiki-virtuoso, owcli, liberfurt-php (ULEI). OntoWiki is a tool providing support for agile, distributed knowledge engineering scenarios. OntoWiki facilitates the visual presentation of a knowledge base as an information map, with different views on instance data. It enables intuitive authoring of semantic content. It fosters social collaboration aspects by keeping track of changes, allowing to comment and discuss every single part of a knowledge base.
ontowiki-cubeviz (ULEI). CubeViz is a facetted browser for statistical data utilizing the RDF Data Cube vocabulary which is the state-of-the-art in representing statistical data in RDF. Based on the vocabulary and the encoded Data Cube, CubeViz is generating a facetted browsing widget that can be used to filter interactively observations to be visualized in charts. On the basis of the selected structure, CubeViz offer beneficiary chart types and options which can be selected by users.
ontowiki-csv-import (ULEI). Statistical data on the web is often published as Excel sheets. Although they have the advantage of being easily readable by humans, they cannot be queried efficiently. Also it is difficult to integrate with other datasets, which may be in different formats. To address those issues this component was developed, which focusses on conversion of multi-dimensional statistical data into RDF using the RDF Data Cube vocabulary.
ore-ui (ULEI). The ORE (Ontology Repair and Enrichment) tool allows for knowledge engineers to improve an OWL ontology by fixing inconsistencies and making suggestions for adding further axioms to it.
r2r (UMA). R2R is a transformation framework. The R2R mapping API is now included directly into the LOD2 demonstrator application, allowing users to experience the full effect of the R2R semantic mapping language through a graphical user interface.
rdf-dataset-integration (ULEI). This tool allows the creation of debian packages for RDF datasets. Installing a created package will autoload the RDF dataset in the Virtuoso on the system.
sieve (UMA). Sieve is a Linked Data Quality Assessment and Fusion tool. It performs quality assessment and resolves conflicts in a task-specific way according to user configuration.
sigmaee (DERI). Sigma EE is an entity search engine and browser for the Web of Data. Sig.ma EE is a standalone, deployable, customisable version of Sig.ma. Sig.ma EE is deployed as a web application and will perform on the fly data integration from both local data source and remote services (including Sindice.com).
silk (UMA). The Silk Linking Framework supports data publishers in setting explicit RDF links between data items within different data sources. Using the declarative Silk - Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked. These link conditions may combine various similarity metrics and can take the graph around a data item into account, which is addressed using an RDF path language.
silk-latc (DERI). An improved version of SILK that has been used in the LATC project.
siren (DERI). SIREn is a Lucene/Solr extension for efficient schemaless semi-structured full-text search. SIREn is not a complete application by itself, but rather a code library and API that can easily be used to create a full-featured semi-structured search engine.
sparqled (DERI). SparQLed is an interactive SPARQL editor that provides context-aware recommendations, helping users in formulating complex SPARQL queries across multiple heterogeneous data sources.
sparqlify (ULEI). Sparqlify is a SPARQL-SQL rewriter that enables one to define RDF views on relational databases and query them with SPARQL.
sparqlproxy-php (ULEI). A PHP forward proxy for remote access to SPARQL endpoints; forwards request/response headers and filters out non-SPARQL URL arguments.
stanbol (External - ULEI). Apache Stanbol’s intended use is to extend traditional content management systems with semantic services.
valiant (TenForce). Valiant is an command line tool that automates the extraction of RDF data from XML documents. Intended for bulk application on a large amount of XML documents.
virtuoso-opensource (OGL). Virtuoso is a knowledge store and virtualization platform that transparently integrates Data, Services, and Business Processes across the enterprise. Its product architecture enables it to deliver traditionally distinct server functionality within a single system offering along the following lines: Data Management & Integration (SQL, XML and EII), Application Integration (Web Services & SOA), Process Management & Integration (BPEL), Distributed Collaborative Applications.
virtuoso-vad-bpel, virtuoso-vad-conductor, virtuoso-vad-demo, virtuoso-vad-doc, virtuoso-vad-isparql, virtuoso-vad-ods, virtuoso-vad-rdfmappers,virtuoso-vad-sparqldemo, virtuoso-vad-syncml, virtuoso-vad-tutorial (OGL). Virtuoso Application Distributions as debian packages, These VAD packages extend the functionality of Virtuoso, i.e. there is the web system admin interface (the conductor), the interactive SPARQL interface and many more.
EDCAT (TenForce). EDCAT is a service API for a DCAT based Linked Data catalogue. It provides a json compatible view with the DCAT W3C standard. The API is extendible via a plugin architecture. Its main purpose is to serve as an integration layer for to establish dataset catalogues inside organizations.
CKAN (OKF). CKAN is a system for storing, cataloguing and visualising data or other “knowledge” resources. CKAN aims to make it easy to find, share and re-use open content and data, especially in ways that are machine automatable.
lod2stable-repository, lod2testing-repository. These packages activate the repositories in order to get a coherent stack installation.
2.5.2 Available as Online Component
Payola (UEP). Payola is a web application which lets you work with graph data in a new way. You can visualize Linked Data using several preinstalled plugins as graphs, tables, etc. Moreover, you can create an analysis and run it against a set of SPARQL endpoints. Analysis results are processed and visualized using the embedded visualization plugins.
PoolParty Thesaurus Manager (SWC). PoolParty is a thesaurus management system and a SKOS editor for the Semantic Web including text mining and linked data capabilities. The system helps to build and maintain multilingual thesauri providing an easy-to-use interface. PoolParty server provides semantic services to integrate semantic search or recommender systems into systems like CMS, DMS, CRM or Wikis.
PoolParty Extractor (SWC). The PoolParty Extractor (PPX) offers an API providing text mining algorithms based on semantic knowledge models. With the PoolParty Extractor you can analyse documents in an automated fashion, extracting meaningful phrases, named entities, categories or other metadata. Different data or metadata schemas can be mapped to a SKOS thesaurus that is used as a unified semantic knowledge model.
Sindice (DERI, OGL). Sindice is a state of the art infrastructure to process, consolidate and query the Web of Data. Sindice collates these billions of pieces of metadata into an coherent umbrella of functionalities and services.
2.5.3 Available Online Data Sources
Table 1 provides an overview of the main data sources to which LOD2 has contributed.
2.5.4 The LOD2 Stack Components Functional Areas Coverage
When distributing the components over the Linked Data Publishing cycle functionalities the following Fig. 2 is obtained. In this figure the component is associated with its main role in the data publishing cycle. In the middle are placed the applications that are not dedicated to one area, such as the automatization platform UnifiedViews, and the applications that exploit the LOD2 stack, such as the lod2demo and LOD2 Statistical Workbench. Online components are marked with a globe icon. The cubed formatted components are dedicated components for the statistical domain. They are embedded in the LOD2 Statistical Workbench.
In this figure, we see that the current version of the stack has a high number of components geared towards the extraction, storage and querying parts. This can be explained by the large number of data formats that need to be transformed to RDF. Every component tackles a specific subset of these formats. Most of the other components contain a small selection of specialized tools for a specific task. In the search/browsing/exploration stage, every component has its own way of visualizing the data.