The individual use cases of LinkedPipes ETL will be demonstrated on the live demo instanceFootnote 5. The aim is to show how LP-ETL helps with the creation and debugging of typical ETL pipelines with Linked Data output. To demonstrate the complexity of typical ETL pipelines let us discuss 182 datasets published by the COMSODE project teamFootnote 6. For each of the datasets, a UV pipeline was developed. In total, 89 different components were used in these pipelines. These are responsible for gathering input data, its transformation to RDF including alignment with various ontologies and linkage to third-party datasets and for cleansing of datasets. Each pipeline contained an average number of 23 component instances. The execution time of the pipelines ranged from several minutes in the case of transformation of small XLS files with codelists to RDF representation in SKOS, to several days in the case of transformation of the Czech Registry of Addresses to its Linked Data representation including alignment with various ontologies and linkage to external data sources. This shows that real-world ETL pipelines are complex, cover many processing steps and may run for a very long time. Therefore, the support for smarter workflow for creating ETL pipelines and their advanced debugging demonstrated by our use-cases is imperative.
4.1 Smarter Workflow for Creating ETL Pipelines
In this use case, we will demonstrate the process of creating a simple ETL pipeline (see Fig. 1), from downloading the data sources, transforming them using SPARQL and other techniques and loading them to a triplestore. This will be demonstrated on the CSV open data of the Czech Trade Inspection AuthorityFootnote 7. The development of the pipelines in LinkedPipes ETL is governed by an evolved workflow. In each step where a user adds another component to the pipeline, the components are recommended according to various criteria. The first criterion is data unit compatibility, which is checked and only components that consume what the previous component produces will be offered. Another criterion is the probability of the appearance of the component after the last component in the pipeline, to which it will be connected. This probability is based on existing pipelines in the instance. The last criterion is full text search and tag based search. Each component has a name and a set of tags assigned to it. The user can start typing and the offered components get filtered both according to their names and the tags. The usefulness of tags can be demonstrated on the Decompress component that unpacks ZIP and BZIP2 files. When a user searches for a component that could unzip his input files, he may type unzip or zip, which will not suggest the Decompress component simply based on its name. This is were tags help, the Decompress component can have tags for each of the supported formats (i.e. zip, bzip2) and for the actions (unzip, unbzip, decompress, unpack) which increases the chances to find the required component even without the knowledge of its name.
4.2 Advanced Debugging Support
In a typical case of development of an ETL pipeline, errors happen. They can be caused for example by external services failing or misconfiguration of components. In these situations, having a proper support for debugging is crucial. Imagine an ETL pipeline that runs for a week, or a complex pipeline consisting of a hundred components, both of which happen in the real world. At any time, the pipeline may fail while already having done some substantial work. In UnifiedViews, when this happened, there were only two options. The user could either rerun the pipeline or search the server for the RDF data.
In LP-ETL we developed a new debug from functionality, which allows the user to resume a failed pipeline from an arbitrary point of its previous execution after it is fixed. This saves some effort, especially for long running pipelines, since the user can fix the component that was misconfigured and continue execution from that point. In addition, LinkedPipes ETL implements the debug to functionality, which allows the user to run a part of a pipeline from its start to a certain point. A typical use case is when following the progressive enrichment Linked Data patternFootnote 8. First, a complete pipeline with simple transformation of input data is developed, including data gathering, loading and cataloging. Later, the data transformation is enhanced e.g. by improving a SPARQL query in one component, leaving the rest of the pipeline intact. This improved debugging functionality will be demonstrated on the same pipeline as the use case in Subsect. 4.1. There is a pipeline which failed on the RDF to file component due to misconfiguration. We fix it and restart the pipeline from this component. Next, the pipeline fails again (see Fig. 1) on the other RDF to file, again due to misconfiguration. The dark green bordered components are done in the first execution, the light green bordered components done in the current one and the red component is the one which failed.
4.3 Full API Coverage and RDF Configuration
LinkedPipes ETL can be integrated in larger platforms and extended by alternate user interfaces. It was designed with this in mind from the beginning. All of the components have REST API interfaces and use RDF for configuration of components and pipelines. The same interface is used by LinkedPipes ETL itself and this means that it can be also used by any other software. This will be demonstrated by showing how a pipeline can be uploaded, executed and its state monitored from the command line using curl.
4.4 Sharing Pipelines and Fragments
LinkedPipes ETL eases sharing of pipelines and pipeline fragments. Since pipeline definitions are simple RDF files in JSON-LD serialization, they can be shared on the web or on GitHub for reuse. We will demonstrate this on the pipeline from Subsects. 4.1 to 4.2 where we will download it to demonstrate the debugging capabilities and the fully working version from the LinkedPipes ETL web documentation.