Keywords

1 Introduction

From 12 datasets cataloged in 2007, the Linked Open Data cloud has grown to nearly 1000 datasets containing more than 82Ā billion triplesFootnote 1Ā [4]. Data is being published by both the public and private sectors and covers a diverse set of domains from life sciences to media or government data. The Linked Open Data cloud is potentially a gold mine for organizations and individuals who are trying to leverage external data sources in order to produce more informed business decisionsĀ [8].

Dataset discovery can be done through public data portals like Datahub.io and publicdata.eu or private ones like quandl.com and enigma.io. Private portals harness manually curated data from various sources and expose them to users either freely or through paid plans. Similarly, in some public data portals, administrators manually review datasets information, validate, correct and attach suitable metadata information. This information is mainly in the form of predefined tags such as media, geography, life sciences for organization and clustering purposes. However, the diversity of those datasets makes it harder to classify them in a fixed number of predefined tags that can be subjectively assigned without capturing the essence and breadth of the datasetĀ [21]. Furthermore, the increasing number of datasets available makes the metadata review and curation process unsustainable even when outsourced to communities.

There are several Data Management Systems (DMS) that power public data portals. CKANFootnote 2 is the worldā€™s leading open-source data portal platform powering web sites like DataHub, Europeā€™s Public Data and the U.S Governmentā€™s open data. Modeled on CKAN, DKANFootnote 3 is a standalone Drupal distribution that is used in various public data portals as well. SocrataFootnote 4 helps public sector organizations improve data-driven decision making by providing a set of solutions including an open data portal. In addition to these tradition data portals, there is a set of tools that allow exposing data directly as RESTful APIs like thedatatank.com.

Metadata provisioning is one of the Linked Data publishing best practices mentioned inĀ [3]. Datasets should contain the metadata needed to effectively understand and use them. This information includes the datasetā€™s license, provenance, context, structure and accessibility. The ability to automatically check this metadata helps in:

  • Delaying data entropy: Information entropy refers to the degradation or loss limiting the information content in raw or metadata. As a consequence of information entropy, data complexity and dynamicity, the life span of data can be very short. Even when the raw data is properly maintained, it is often rendered useless when the attached metadata is missing, incomplete or unavailable. Comprehensive high quality metadata can counteract these factors and increase dataset longevityĀ [20].

  • Enhancing data discovery, exploration and reuse: Users who are unfamiliar with a dataset require detailed metadata to interpret and analyze accurately unfamiliar datasets. A study conducted by the European Union commissionĀ [29] found that both business and users are facing difficulties in discovering, exploring and reusing public data due to missing or inconsistent metadata information.

  • Enhancing spam detection: Portals hosting public open data like Datahub allow anyone to freely publish datasets. Even with security measures like captchas and anti-spam devices, detecting spam is increasingly difficult. In addition to that, the increasing number of datasets hinders the scalability of this process, affecting the correct and efficient spotting of datasets spam.

Data profiling is the process of creating descriptive information and collect statistics about that data. It is a cardinal activity when facing an unfamiliar datasetĀ [24]. Data profiles reflect the importance of datasets without the need for detailed inspection of the raw data. It also helps in assessing the importance of the dataset, improving usersā€™ ability to search and reuse part of the dataset and in detecting irregularities to improve its quality. Data profiling includes typically several tasks:

  • Metadata profiling: Provides general information on the dataset (dataset description, release and update dates), legal information (license information, openness), practical information (access points, data dumps), etc.

  • Statistical profiling: Provides statistical information about data types and patterns in the dataset (e.g. properties distribution, number of entities and RDF triples).

  • Topical profiling: Provides descriptive knowledge on the dataset content and structure. This can be in form of tags and categories used to facilitate search and reuse.

In this work, we address the challenges of automatic validation and generation of descriptive dataset profile. This paper proposes Roomba, an extensible framework consisting of a processing pipeline that combines techniques for data portals identification, datasets crawling and a set of pluggable modules combining several profiling tasks. The framework validates the provided dataset metadata against an aggregated standard set of information. Metadata fields are automatically corrected when possible (e.g. adding a missing license URL reference). Moreover, a report describing all the issues highlighting those that cannot be automatically fixed is created to be sent by email to the datasetā€™s maintainer. There exist various statistical and topical profiling tools for both relational and Linked Data. The architecture of the framework allows to easily add them as additional profiling tasks. However, in this paper, we focus on the task of dataset metadata profiling. We validate our framework against a manually created set of profiles and manually check its accuracy by examining the results of running it on various CKAN-based data portals.

The remainder of the paper is structured as follows. In Sect.Ā 2, we review relevant related work. In Sect.Ā 3, we describe our proposed frameworkā€™s architecture and components that validate and generate dataset profiles. In Sect.Ā 4, we evaluate the framework and we finally conclude and outline some future work in Sect.Ā 5.

2 Related Work

Data Catalog Vocabulary (DCAT)Ā [25] and the Vocabulary of Interlinked Datasets (VoID)Ā [11] are concerned with metadata about RDF datasets. There exist several tools aiming at exposing dataset metadata using these vocabularies. InĀ [6], the authors generate VoID descriptions limited to a subset of properties that can be automatically deduced from resources within the dataset. However, it still provides data consumers with interesting insights. Flemmingā€™s Data Quality Assessment ToolFootnote 5 provides basic metadata assessment as it computes data quality scores based on manual user input. The user assigns weights to the predefined quality metrics and answers a series of questions regarding the dataset. These include, for example, the use of obsolete classes and properties by defining the number of described entities that are assigned disjoint classes, the usage of stable URIs and whether the publisher provides a mailing list for the dataset. The ODI certificateFootnote 6, on the other hand, provides a description of the published data quality in plain English. It aspires to act as a mark of approval that helps publishers understand how to publish good open data and users how to use it. It gives publishers the ability to provide assurance and support on their data while encouraging further improvements through an ascending scale. ODI comes as an online and free questionnaire for data publishers focusing on certain characteristics about their data.

Metadata Profiling: The Project Open Data DashboardFootnote 7 tracks and measures how US government web sites implement the Open Data principles to understand the progress and current status of their public data listings. A validator analyzes machine readable files: e.g. JSON files for automated metrics like the resolved URLs, HTTP status and content-type. However, deep schema information about the metadata is missing like description, license information or tags. Similarly on the LOD cloud, the Datahub LOD ValidatorFootnote 8 gives an overview of Linked Data sources cataloged on the Datahub. It offers a step-by-step validator guidance to check a dataset completeness level for inclusion in the LOD cloud. The results are divided into four different compliance levels from basic to reviewed and included in the LOD cloud. Although it is an excellent tool to monitor LOD compliance, it still lacks the ability to give detailed insights about the completeness of the metadata and overview on the state of the entire LOD cloud group and it is very specific to the LOD cloud group rules and regulations.

Statistical Profiling: Calculating statistical information on datasets is vital to applications dealing with query optimization and answering, data cleansing, schema induction and data miningĀ [14, 17, 21]. Semantic sitemapsĀ [10] and RDFStatsĀ [22] are one of the first to deal with RDF data statistics and summaries. ExpLODĀ [19] creates statistics on the interlinking between datasets based on owl:sameAs links. InĀ [24], the author introduces a tool that induces the actual schema of the data and gather corresponding statistics accordingly. LODStatsĀ [2] is a stream-based approach that calculates more general dataset statistics. ProLOD++Ā [1] is a Web-based tool that allows LOD analysis via automatically computed hierarchical clusteringĀ [7]. AetherĀ [26] generates VoID statistical descriptions of RDF datasets. It also provides a Web interface to view and compare VoID descriptions. LODOPĀ [13] is a MapReduce framework to compute, optimize and benchmark dataset profiles. The main target for this framework is to optimize the runtime costs for Linked Data profiling. InĀ [18] authors calculate certain statistical information for the purpose of observing the dynamic changes in datasets.

Topical Profiling: Topical and categorical information facilitates dataset search and reuse. Topical profiling focuses on content-wise analysis at the instances and ontological levels. GERBILĀ [28] is a general entity annotation framework that provides machine processable output allowing efficient querying. In addition, there exist several entity annotation tools and frameworksĀ [9] but none of those systems are designed specifically for dataset annotation. InĀ [15], the authors created a semantic portal to manually annotate and publish metadata about both LOD and non-RDF datasets. InĀ [21], the authors automatically assigned Freebase domains to extracted instance labels of some of the LOD Cloud datasets. The goal was to provide automatic domain identification, thus enabling improving datasets clustering and categorization. InĀ [5], the authors extracted dataset topics by exploiting the graph structure and ontological information, thus removing the dependency on textual labels. InĀ [12], the authors generate VoID and VoL descriptions via a processing pipeline that extracts dataset topic models ranked on graphical models of selected DBpedia categories.

Although the above mentioned tools are able to provide various types of information about a dataset, there exists no approach that aggregates this information and is extensible to combine additional profiling tasks. To the best of our knowledge, this is the first effort towards extensible automatic validation and generation of descriptive dataset profiles.

3 Profiling Data Portals

In this section, we provide an overview of Roombaā€™s architecture and the processing steps for validating and generating dataset profiles. FigureĀ 1 shows the main steps which are the following: (i) data portal identification; (ii) metadata extraction; (iii) instance and resource extraction; (iv) profile validation (v) profile and report generation.

Roomba is built as a Command Line Interface (CLI) application using Node.js. Instructions on installing and running the framework are available on its public Github repositoryFootnote 9. The various steps are explained in detail below.

Fig. 1.
figure 1

Processing pipeline for validating and generating dataset profiles

3.1 Data Portal Identification

Roomba should be extensible to any data portal that exposes its functionalities via an external accessible API. Since every portal ca have its own data model, identifying the software powering data portals is a vital first step. We rely on several Web scraping techniques in the identification process which includes a combination of the following:

  • URL inspection: Various CKAN based portals are hosted on subdomains of the http://ckan.net. For example, CKAN Brazil (http://br.ckan.net). Checking the existence of certain URL patterns can detect such cases.

  • Meta tags inspection: The <meta> tag provides metadata about the HTML document. They are used to specify page description, keywords, author, etc. Inspecting the content attribute can indicate the type of the data portal. We use CSS selectors to check the existence of these meta tags. An example of a query selector is meta[content*=ā€œckanā€] (all meta tags with the attribute content containing the string CKAN). This selector can identify CKAN portals whereas the meta[content*=ā€œDrupalā€] can identify DKAN portals.

  • Document Object Model (DOM) inspection: Similar to the meta tags inspection, we check the existence of certain DOM elements or properties. For example, CKAN powered portals will have DOM elements with class names like ckan-icon or ckan-footer-logo. A CSS selector like .ckan-icon will be able to check if a DOM element with the class name ckan-icon exists. The list of elements and properties to inspect is stored in a separate configurable object for each portal. This allows the addition and removal of elements as deemed necessary.

The identification process for each portal can be easily customized by overriding the default function. Moreover, adding or removing steps from the identification process can be easily configured.

After those preliminary checks, we query one of the portalā€™s API endpoints. For example, DataHub is identified as CKAN, so we will query the API endpoint on http://datahub.io/api/action/package_list. A successful request will list the names of the siteā€™s datasets, whereas a failing request will signal a possible failure of the identification process.

3.2 Metadata Extraction

Data portals expose a set of information about each dataset as metadata. The model used varies across portals. However, a standard model should contain information about the datasetā€™s title, description, maintainer email, update and creation date, etc. We divided the metadata information into the following types:

General Information: General information about the dataset. e.g., title, description, ID, etc. This general information is manually filled by the dataset owner. In addition to that, tags and group information is required for classification and enhancing dataset discoverability. This information can be entered manually or inferred modules plugged into the topical profiler.

Access Information: Information about accessing and using the dataset. This includes the dataset URL, license information i.e., license title and URL and information about the datasetā€™s resources. Each resource has as well a set of attached metadata e.g., resource name, URL, format, size.

Ownership Information: Information about the ownership of the dataset. e.g., organization details, maintainer details, author. The existence of this information is important to identify the authority on which the generated report and the newly corrected profile will be sent to.

Provenance Information: Temporal and historical information on the dataset and its resources. For example, creation and update dates, version information, version, etc. Most of this information can be automatically filled and tracked.

Building a standard metadata model is not the scope of this paper, and since we focus on CKAN-based portals, we validate the extracted metadata against the CKAN standard modelFootnote 10.

After identifying the underlying portal software, we perform iterative queries to the API in order to fetch datasets metadata and persist them in a file-based cache system. Depending on the portal software, we can issue specific extraction jobs. For example, in CKAN-based portals, we are able to crawl and extract the metadata of a specific dataset, all the datasets in a specific group (e.g. LOD cloud) or all the datasets in the portal.

3.3 Instance and Resource Extraction

From the extracted metadata we are able to identify all the resources associated with that dataset. They can have various types like a SPARQL endpoint, API, file, visualization, etc. However, before extracting the resource instance(s) we perform the following steps:

  • Resource metadata validation and enrichment: Check the resource attached metadata values. Similar to the dataset metadata, each resource should include information about its mimetype, name, description, format, valid de-referenceable URL, size, type and provenance. The validation process issues an HTTP request to the resource and automatically fills up various missing information when possible, like the mimetype and size by extracting them from the HTTP response header. However, missing fields like name and description that needs manual input are marked as missing and will appear in the generated summary report.

  • Format validation: Validate specific resource formats against a liter or a validator. For example, node-csvFootnote 11 for CSV files and n3Footnote 12 to validate N3 and Turtle RDF serializations.

Considering that certain datasets contain large amounts of resources and the limited computation power of some machines on which the framework might run on, a sampler module can be introduced to execute various sample-based strategies detailed as they were found to generate accurate results even with comparably small sample size of 10Ā %. These strategies introduced inĀ [12] are:

  • Random Sampling: Randomly selects resource instances.

  • Weighted Sampling: Weighs each resources as the ratio of the number of datatype properties used to define a resource over the maximum number of datatype properties over all the datasets resources.

  • Resource Centrality Sampling: Weighs each resource as the ration of the number of resource types used to describe a particular resource divided by the total number of resource types in the dataset. This is specific and important to RDF datasets where important concepts tend to be more structured and linked to other concepts.

However, the sampler is not restricted only to these strategies. Strategies like those introduced inĀ [23] can be configured and plugged in the processing pipeline.

3.4 Profile Validation

A dataset profile should include descriptive information about the data examined. In our framework, we have identified three main categories of profiling information. However, the extensibility of our framework allows for additional profiling techniques to be plugged in easily (i.e. a quality profiling module reflecting the dataset quality). In this paper, we focus on the task of metadata profiling.

Metadata validation process identifies missing information and the ability to automatically correct them. Each set of metadata (general, access, ownership and provenance) is validated and corrected automatically when possible. Each profiler task has a set of metadata fields to check against. The validation process check if each field is defined and if the value assigned is valid.

There exist many special validation steps for various fields. For example, the email addresses and urls should be validated to ensure that the value entered is syntactically correct. In addition to that, for urls, we issue an HTTP HEAD request in order to check if that URL is reachable. We also use the information contained in a valid content-header response to extract, compare and correct some resources metadata values like mimetype and size.

From our experiments, we found out that datasetsā€™ license information is noisy. The license names if found are not standardized. For example, Creative Commons CCZero can be also CC0 or CCZero. Moreover, the license URI if found and if de-referenceable can point to different reference knowledge bases e.g., http://opendefinition.org. To overcome this issue, we have manually created a mapping file standardizing the set of possible license names and the reference knowledge baseFootnote 13. In addition, we have also used the open source and knowledge license informationFootnote 14 to normalize the license information and add extra metadata like the domain, maintainer and open data conformance.

figure a

3.5 Profile and Report Generation

The validation process highlights the missing information and presents them in a human readable report. The report can be automatically sent to the dataset maintainer email if exists in the metadata. In addition to the generated report, the enhanced profiles are represented in JSON using the CKAN data model and are publicly availableFootnote 15.

Data portal administrators need an overall knowledge of the portal datasets and their properties. Our framework has the ability to generate numerous reports of all the datasets by passing formatted queries. There are two main sets of aggregation tasks that can be run:

  • Aggregating meta-field values: Passing a string that corresponds to a valid field in the metadata. The field can be flat like license_title (aggregates all the license titles used in the portal or in a specific group) or nested like resource>resource_type (aggregates all the resources types for all the datasets). Such reports are important to have an overview of the possible values used for each metadata field.

  • Aggregating key:object meta-field values: Passing two meta-field values separated by a colon : e.g., resources>resource_type:resources>name. These reports are important as you can aggregate the information needed when also having the set of values associated to it printed.

For example, the meta-field value query resource>resource_type run against the LODCloud group will result in an array containing [file,Ā api,Ā documentation ...] values. These are all the resource types used to describe all the datasets of the group. However, to be able to know also what are the datasets containing resources corresponding to each type, we issue a key:object meta-field query resource>resource_type:name. The result will be a JSON object having the resource_type as the key and an array of corresponding datasets titles that has a resource of that type.

figure b

4 Experiments and Evaluation

In this section, we provide the experiments and evaluation of the proposed framework. All the experiments are reproducible by our tool and their results are available in its Github repository. A CKAN dataset metadata describes four main sections in addition to the core datasetā€™s properties. These sections are:

  • Resources: The distributable parts containing the actual raw data. They can come in various formats (JSON, XML, RDF, etc.) and can be downloaded or accessed directly (REST API, SPARQL endpoint).

  • Tags: Provide descriptive knowledge on the dataset content and structure. They are used mainly to facilitate search and reuse.

  • Groups: A dataset can belong to one or more group that share common semantics. A group can be seen as a cluster or a curation of datasets based on shared categories or themes.

  • Organizations: A dataset can belong to one or more organization controlled by a set of users. Organizations are different from groups as they are not constructed by shared semantics or properties, but solely on their association to a specific administration party.

Each of these sections contains a set of metadata corresponding to one or more type (general, access, ownership and provenance). For example, a dataset resource will have general information such as the resource name, access information such as the resource url and provenance information such as creation date. The framework generates a report aggregating all the problems in all these sections, fixing field values when possible. Errors can be the result of missing metadata fields, undefined field values or field value errors (e.g. unreachable URL or incorrect email addresses).

4.1 Experimental Setup

We ran our tool on two CKAN-based data portals. The first one is datahub.io targeting specifically the LOD cloud group. The current state of the LOD cloud reportĀ [27] indicates that the LOD cloud contains 1014 datasets. They were harvested via a LDSpider crawlerĀ [16] seeded with 560 thousands URIs. Roomba, on the other hand, fetches datasets hosted in data portals where datasets have attached relevant metadata. As a result, we relied on the information provided by the Datahub CKAN API. Examining the tags available, we found two candidate groups. The first one tagged with ā€œlodcloudā€ returned 259 datasets, while the second one tagged with ā€œlodā€ returned only 75 datasets. After manually examining the two lists, we found out the datasets grouped with the tag ā€œlodcloudā€ are the correct ones. To qualify other CKAN-based portals for the experiments, we use http://dataportals.org/ which contains a comprehensive list of Open Data portals from around the world. In the end, we chose the Amsterdam data portalFootnote 16. The portal was commissioned in 2012 by the Amsterdam Economic Board Open Data Exchange (ODE) and covers a wide range of information domains (energy, economy, education, urban development, etc.) about Amsterdam metropolitan region.

We ran the Roomba instance and resource extractors in order to cache the metadata files for these datasets locally and ran the validation process. The experiments were executed on a 2.6Ā Ghz Intel Core i7 processor with 16Ā GB of DDR3 memory machine. The approximate execution time alongside the summary of the datasetsā€™ properties are presented in TableĀ 1.

Table 1. Summary of the experiments details

In our evaluation, we focused on two aspects: (i) profiling correctness which manually assesses the validity of the errors generated in the report, and (ii) profiling completeness which assesses if the profilers cover all the errors in the datasets metadata.

4.2 Profiling Correctness

To measure profile correctness, we need to make sure that the issues reported by Roomba are valid on the dataset, group and portal levels.

On the dataset level, we choose three datasets from both the LOD Cloud and the Amsterdam data portal. The datasets details are shown in TableĀ 2.

Table 2. Datasets chosen for the correctness evaluation

To measure the profiling correctness on the groups level, we selected four groups from the Amsterdam data portal containing a total of 25 datasets. The choice was made to cover groups in various domains that contain a moderate number of datasets that can be checked manually (between 3ā€“9 datasets). TableĀ 3 summarizes the groups chosen for the evaluation.

Table 3. Groups chosen for the correctness evaluation

After running Roomba and examining the results on the selected datasets and groups, we found out that our framework provides 100Ā % correct results on the individual dataset level and on the aggregation level over groups. Since our portal level aggregation is extended from the group aggregation, we can infer that the portal level aggregation also produces complete correct profiles. However, the lack of a standard way to create and manage collections of datasets was the source of some errors when comparing the results from these two portals. For example, in Datahub, we noticed that all the datasets groups information were missing, while in the Amsterdam Open Data portal, all the organisation information was missing. Although the error detection is correct, the overlap in the usage of group and organization can give a false indication about the metadata quality.

4.3 Profiling Completeness

We analyzed the completeness of our framework by manually constructing a set of profiles that act as a golden standard. These profiles cover the range of uncommon problems that can occur in a certain datasetFootnote 17. These errors are:

  • Incorrect mimetype or size for resources;

  • Invalid number of tags or resources defined;

  • Check if the license information can be normalized via the license_id or the license_title as well as the normalization result;

  • Syntactically invalid author_email or maintainer_email.

After running our framework at each of these profiles, we measured the completeness and correctness of the results. We found out that our framework covers indeed all the metadata problems that can be found in a CKAN standard model correctly.

5 Conclusion and Future Work

In this paper, we proposed a scalable automatic approach for extracting, validating, correcting and generating descriptive linked dataset profiles. This approach applies several techniques in order to check the validity of the metadata provided and to generate descriptive and statistical information for a particular dataset or for an entire data portal. Based on our experiments running the tool on the LOD cloud, we discovered that the general state of the datasets needs attention as most of them lack informative access information and their resources suffer low availability. These two metrics are of high importance for enterprises looking to integrate and use external linked data.

It has been noticed that the issues surrounding metadata quality affect directly dataset search as data portals rely on such information to power their search index. We noted the need for tools that are able to identify various issues in this metadata and correct them automatically. We evaluated our framework manually against two prominent data portals and proved that we can automatically scale the validation of datasets metadata profiles completely and correctly.

As part of our future work, we plan to introduce workflows that will be able to correct the rest of the metadata either automatically or through intuitive manually-driven interfaces. We also plan to integrate statistical and topical profilers to be able to generate full comprehensive profiles. We also intend to suggest a ranked standard metadata model that will help generate more accurate and scored metadata quality profiles. We also plan to run this tool on various CKAN-based data portals, schedule periodic reports to monitor the evolvement of datasets metadata. Finally, at some stage, we plan to extend this tool for other data portal types like DKAN and Socrata.