1 Introduction

The Linked Open Data (LOD) cloudFootnote 1 has grown significantly in the past years, offering various datasets covering a broad set of domains from life sciences to media and government data [3]. To maintain high quality data, publishers should comply with a set of best practices detailed in [2]. Metadata provisioning is one of those best practices requiring publishers to attach metadata needed to effectively understand and use datasets.

Data portals expose metadata via various models. A model should contain the minimum amount of information that conveys to the inquirer the nature and content of its resources [9]. It should contain information to enable data discovery, exploration and exploitation. We divided the metadata information into the following types:

  • General information: General information about the dataset (e.g. title, description, ID). This general information is manually filled by the dataset owner. In addition to that, tags and group information is required for classification and enhancing dataset discoverability.

  • Access information: Information about accessing and using the dataset. This includes the dataset URL, some license information (i.e. license title and URL) and information about the datasets resources. Each resource has generally a set of attached metadata (e.g. resource name, URL, format, size).

  • Ownership information: Information about the ownership of the dataset (e.g. organization details, maintainer details, author). The existence of this information is important to identify the authority on which the generated report and the newly corrected profile will be sent to.

  • Provenance information: Temporal and historical information on the dataset and its resources (e.g. creation and update dates, version information, version number). Most of this information can be automatically filled and tracked.

Data portals are datasets’ access points providing tools to facilitate data publishing, sharing, searching and visualization. CKANFootnote 2 is the world’s leading open-source data portal platform powering web sites like the Datahub which hosts the LOD cloud metadata.

We have created Roomba [1], a tool that automatically validates, corrects and generates dataset metadata for CKAN portals. The datasets are validated against the CKAN standard metadata modelFootnote 3. The model describes four main sections in addition to the core dataset’s properties. These sections are:

  • Resources: The actual accessible raw data. They can come in various formats (JSON, XML, RDF, etc.) and can be downloaded or accessed directly (REST API, SPARQL endpoint).

  • Tags: Provide descriptive knowledge on the dataset content and structure.

  • Groups: Used to cluster or a curate datasets based on shared themes or semantics.

  • Organizations: Organizations describe datasets solely on their association to a specific administrative party.

The results demonstrate that the general state of the examined datasets needs much more attention as most of the datasets suffers from bad quality metadata and lacking some informative metrics needed that would facilitate dataset search. The noisiest metadata values were access information such as licensing information and resource descriptions in addition to large numbers of resource reachability problems. We also show that the automatic corrections of the tool increase the overall quality of the datasets metadata and highlight the need for manual efforts to correct some important missing information.

2 Related Work

The Data Catalog Vocabulary (DCAT) [8] and the Vocabulary of Interlinked Datasets (VoID) [5] are models for representing RDF datasets metadata. There exist several tools aiming at exposing dataset metadata using these vocabularies such as [4]. Few approaches tackle the issue of examining datasets metadata. The Project Open Data DashboardFootnote 4 validator analyzes machine readable files for automated metrics to check their alignment with the Open Data principles. Similarly on the LOD cloud, the Datahub LOD ValidatorFootnote 5 checks a dataset compliance for inclusion in the LOD cloud. However, it lacks the ability to give detailed insights about the completeness of the metadata and an overview on the state of the entire LOD cloud group.

The State of the LOD Cloud Report [7] measures the adoption of Linked Data best practices back in 2011. More recently, the authors in [10] used LDSpider [6] to crawl and analyze 1014 different datasets in the web of Linked Data in 2014. While these reports expose important information about datasets like provenance, licensing and accessibility, they do not cover the entire spectrum of metadata categories as presented in [11].

3 Experiments and Evaluation

In this section, we describe our experiments when running the Roomba tool on the LOD cloud. All the experiments are reproducible by our tool and their results are available on its Github repository at https://github.com/ahmadassaf/opendata-checker.

3.1 Experimental Setup

The current state of the LOD cloud report [10] indicates that there are more than 1014 datasets available. These datasets have been harvested by the LDSpider crawler [6] seeded with 560 thousands URIs. However, since Roomba requires the datasets metadata to be hosted in a data portal where either the dataset publisher or the portal administrator can attach relevant metadata to it, we rely on the information provided by the Datahub CKAN API. We consider two possible groups: the first one tagged with “lodcloud” returns 259 datasets, while the second one tagged with “lod” returns only 75 datasets. We manually inspect these two lists and find out that the API result for the tag “lodcloud” is the correct one. The 259 datasets contain a total of 1068 resources. We run the instance and resource extractor from Roomba in order to cache the metadata files for these datasets locally and we launch the validation process which takes around two and a half hours on a 2.6 Ghz Intel Core i7 processor with 16 GB of DDR3 memory machine.

3.2 Results and Evaluation

CKAN dataset metadata includes three main sections in addition to the core dataset’s properties. Those are the groups, tags and resources. Each section contains a set of metadata corresponding to one or more metadata type. For example, a dataset resource will have general information such as the resource name, access information such as the resource url and provenance information such as creation date. The framework generates a report aggregating all the problems in all these sections, fixing field values when possible. Errors can be the result of missing metadata fields, undefined field values or field value errors (e.g. unreachable URL or syntactically incorrect email addresses).

Figures 1 and 2 show the percentage of errors found in metadata fields by section and by information type respectively. We observe that the most erroneous information for the dataset core information is related to ownership since this information is missing or undefined for 41 % of the datasets. Datasets resources have the poorest metadata. 64 % of the general metadata, all the access information and 80 % of the provenance information contain missing or undefined values. Table 1 shows the top metadata fields errors for each metadata information type.

Table 1. Top metadata fields error % by information type

We notice that 42.85 % of the top metadata problems shown in Table 1 can be fixed automatically. Among them, 44.44 % of these problems can be fixed by our tool while the others can be fixed by tools that should be plugged into the data portal. We further present and discuss the results grouped by metadata information type in the following sub-sections.

3.3 General Information

34 datasets (13.13 %) do not have valid notes values. tags information for the datasets are complete except for the vocabulary_id as this is missing from all the datasets’ metadata. All the datasets groups information are missing display_name, description, title, image_display_url, id, name. After manual examination, we observe a clear overlap between group and organization information. Many datasets like event-media use the organization field to show group related information (being in the LOD Cloud) instead of the publishers details.

3.4 Access Information

25 % of the datasets access information (being the dataset URL and any URL defined in its groups) have issues: generally missing or unreachable URLs. 3 datasets (1.15 %) do not have a URL defined (tip, uniprotdatabases, uniprotcitations) while 45 datasets (17.3 %) defined URLs are not accessible at the time of writing this paper. One dataset does not have resources information (bio2rdfchebi) while the other datasets have a total of 1068 defined resources.

On the datasets resources level, we notice wrong or inconsistent values in the size and mimetype fields. However, 44 datasets have valid size field values and 54 have valid mimetype field values but they were not reachable, thus providing incorrect information. 15 fields (68 %) of all the other access metadata are missing or have undefined values. Looking closely, we notice that most of these problems can be easily fixed automatically by tools that can be plugged to the data portal. For example, the top six missing fields are the cache_last_updated, cache_url, urltype, webstore_last_updated, mimetype_inner and hash which can be computed and filled automatically. However, the most important missing information which require manual entry are the dataset’s name and description which are missing from 817 (76.49 %) and 98 (9.17 %) resources respectively. A total of 334 resources (31.27 %) URLs were not reachable, thus affecting highly the availability of these datasets. CKAN resources can be of various predefined types (filefile.uploadapivisualizationcodeanddocumentation). Roomba also breaks down these unreachable resources according to their types: 211 (63.17 %) resources do not have valid resource_type, 112 (33.53 %) are files, 8 (2.39 %) a re metadata and one (0.029 %) are example and documentation types.

To have more details about the resources URL types, we created a \(key:object meta-field values\) group level report on the LOD cloud with resources>format:title. This will aggregate the resources format information for each dataset. We observe that only 161 (62.16 %) of the datasets valid URLs have SPARQL endpoints defined using the api/sparql resource format. 92.27 % provided RDF example links and 56.3 % provided direct links to RDF down-loadable dumps.

The noisiest part of the access metadata is about license information. A total of 43 datasets (16.6 %) does not have a defined license_title and license_id fields, where 141 (54.44 %) have missing license_url field.

Fig. 1.
figure 1

Error % by section

Fig. 2.
figure 2

Error % by information type

3.5 Ownership Information

Ownership information is divided into direct ownership (author and maintainer) and organization information. Four fields (66.66 %) of the direct ownership information are missing or undefined. The breakdown for the missing information is: 55.21 % maintainer_email, 51.35 % maintainer, 15.06 % author_email, 2.32 % author. Moreover, our framework performs checks to validate existing email values. 11 (0.05 %) and 6 (0.05 %) of the defined author_email and maintainer_email fields are not valid email addresses respectively. For the organization information, two field values (16.6 %) were missing or undefined. 1.16 % of the organization_description and 10.81 % of the organization_image_url information with two out of these URLs are unreachable.

3.6 Provenance Information

80 % of the resources provenance information are missing or undefined. However, most of the provenance information (e.g. metadata_created, metadata_modified) can be computed automatically by tools plugged into the data portal. The only field requiring manual entry is the version field which was found to be missing in 60.23 % of the datasets.

3.7 Enriched Profiles

Roomba can automatically fix, when possible, the license information (title, url and id) as well as the resources mimetype and size.

20 resources (1.87 %) have incorrect mimetype defined, while 52 resources (4.82 %) have incorrect size values. These values have been automatically fixed based on the values defined in the HTTP response header.

We have noticed that most of the issues surrounding license information are related to ambiguous entries. To resolve that, we manually created a mapping fileFootnote 6 standardizing the set of possible license names and urls using the open source and knowledge license informationFootnote 7. As a result, we managed to normalize 123 (47.49 %) of the datasets’ license information.

To check the impact of the corrected fields, we seeded Roomba with the enriched profiles. Since Roomba uses file based cache system, we simply replaced all the datasets json files in the \cache\datahub.io\datasets folder with those generated in \cache\datahub.io\enriched. After running Roomba again on the enriched profiles, we observe that the errors percentage for missing size fields decreased by 32.02 % and for mimetype fields by 50.93 %. We also notice that the error percentage for missing license_urls decreased by 2.32 %.

4 Conclusion and Future Work

In this paper, we presented the results of running Roomba over the LOD cloud group hosted in the Datahub. We discovered that the general state of the examined datasets needs attention as most of them lack informative access information and their resources suffer low availability. These two metrics are of high importance for enterprises looking to integrate and use external linked data. We found out that the most erroneous information for the dataset core information are ownership related since this information is missing or undefined for 41 % of the datasets. Datasets resources have the poorest metadata: 64 % of the general metadata, all the access information and 80 % of the provenance information contained missing or undefined values.

We also show that the automatic correction process can effectively enhance the quality of some information. We believe there is a need to have a community effort to manually correct missing important information like ownership information (maintainer, author, and maintainer and author emails). As part of our future work, we plan to run Roomba on various data portals and perform a detailed comparison to check the metadata health of LOD datasets against those in other prominent data portals.