Section 3 and Table 1 has already introduced and discussed the size of the new releases. For our experiments, we used the versions listed there and in addition the MARVIN pre-release.
As a variety of methods (e.g.
[7], a pre-cursor of SHACL) has been evaluated on DBpedia before and is not repeated here. We focused this evaluation on the novel Construct Validation, which introduce a whole previously invisible error class. Results are summarized, detailed reports will be linked to the Databus artifacts in the future. For this paper, they are archived here.Footnote 20
Construct Validation Tests. To validate the constructs of the triples produced by DIEF, we specified generic and custom domain-specific test cases. With respect to the constructs in Subsect. 5.1, we provide different test cases for IRI compliance and literal conformity to increase the test coverage over the extracted data. The IRI test cases focus on the encoding or layout of an IRI, and check the correct use of several vocabularies. In case of extracted DBpedia instance IRIs, the test cases validate the correctness considering that a DBpedia resource IRIs should not include sequences of ‘?’, ‘#’, ‘[’, ‘]’, ‘%21’, ‘%24’, ‘%26’, ‘%27’, ‘%28’, ‘%29’, ‘%2A’, ‘%2B’, ‘%2C’, ‘%3B’, ‘%3D’ inside the segment part and follows Wikipedia conventions. The vocabulary test cases, which will be automated later, include tests for these schemas:Footnote 21 dbo, foaf, geo, rdf, rdfs, xsd, itsrdf, and skos to ensure the use of the respective ontology or vocabulary specification. Further, generic IRI and literal test cases are implemented to test their syntactical correctness and to validate the lexical format of typed literals. The full collection of specified custom Construct Validation test cases is versioned at the DIEF git repository.Footnote 22
Construct Validation Metrics. We define Construct Validation Metrics to measure the error rate and the overall test coverage for IRI patterns, encoding errors, datatype formats and vocabularies used in the produced data. The overall construct test coverage is defined by dividing the number of constructs that at least trigger one test by the total amount of found constructs.
The overall error rate (in percent) is determined by dividing the number of constructs that have at least one error by the total number of covered constructs.
Table 3. Custom Construct Validation test statistics of the DBpedia and MARVIN release for the generic and mappings group (Gr). Displaying the total IRI counts, the Construct Validation test coverage of IRIs, and construct errors (e.g. wrong IRI pattern or vocab usage) of certain Databus releases.
Test Results. The custom tests for the DBpedia ‘generic’ and ‘mappings’ release have an average of 87% IRI coverage (cf. Table 3). The test coverage can be increased by writing more custom test cases, but concerning the 80/20 rule, this could result in high efforts and the missing IRI patterns are presumably used inside of homepage or external link relations. The new strict syntax cleaning was introduced on the ‘2019.08.30’ version of the mappings release and later applied to the ‘generic’ release. It removes a significant amount of IRIs from the ‘generic’ version (\(\sim \)500 million) and only a fraction from the ‘mappings’ release, reflecting the different extraction quality of them both. Although strict parsing was used and invalid triples are removed, the other errors remain, which we consider a good indicator that the Construct Validation is complementary to syntax parsing.
Table 4. Construct Validation results of the four test cases: XSD date literal (xdt), RDF language string (lang), DBpedia ontology (dbo) and DBpedia Instance URIs which contain a question mark (dbrq). We mention the total number of triggered constructs (prevalence), the aggregated amount of errors, and the percentile error rate.
Table 4 shows four independent Construct Validation test cases.
XSD Date Literal (xdt). This generic triple test validates the correct format use of xsd:date typed literals (
). Due to the use of strict syntax cleaning, as shown in Table 4, subsequent release later than ‘2016.10.01’ do not contain incorrectly formatted date type literals, loosing several million triples. Removing warnings leads to better interoperability later.
RDF Language String (lang). The DIEF uses particular serialization methods to create triples that are often duplicated and contain deprecated code fragments. The post-processing module had an issue to build correct rdf:langString serializations by adding this IRI as explicit datatype instead of the language tag. Considering the N-Triples specification, this is an implicit literal datatype assigned by their language tags. This bug was not recognized by later parsers (i.e. Apache Jena), because the produced statements are syntactically correct. Therefore, to cover this behavior we introduce a generic test case for this kind of literals. The prevalence of this test is described by the pattern ‘
’ and the test validation is defined by an assertion that the pattern should not exist. Moreover, if a construct can be tested, the test directly fails and so the prevalence of the test is equal to its errors. A post-processing bug fix was provided before the ‘2020.04.01’ release, and considering Table 4 was solved properly.
DBpedia Ontology URIs (dbo). To cover correct use of correct vocabularies, some ontology test cases are specified. For the DBpedia ontology this test is assigned to the ‘http://dbpedia.org/ontology/*’ namespace and checks for correctly used IRIs of the DBpedia ontology. The test demonstrates that the used DBpedia ontology instances used inside the three ‘mappings’ release versions do not conform with the DBpedia ontology (cf. Table 4). By inspecting this in detail, we discovered the intensive production of a non-defined class dbo:Location, which is pending to be fixed. Error rate is lower in later releases, as size increased.
DBpedia Instance URIs (dbrq). This test case checks one encoding criterion of extracted DBpedia resource IRIs. Therefore, if a construct matches ‘
’ the last path segment is checked not to contain the ‘?’ symbol as this kind of IRIs should never carry a query part. As displayed in Table 4, the incorrect extraction of the dbr IRIs considering the ‘?’ symbol occurred for version ‘2019.08.30’ and was then solved in later releases.
Test Coverage of Non-DBpedia Datasets. To show the re-usability of the Construct Validation approach, we analyzed a set of external RDF datasets.Footnote 23 For these datasets our custom test cases achieved an average coverage around 10%. (cf. Table 5). The biggest part is covered by the custom vocabulary tests, especially foaf, rdf, rdfs and skos are commonly used across multiple RDF datasets. Another useful test case represents the correct use of DBpedia IRIs inside these datasets (inbound links). Almost in all external datasets, it could be recognized that backlinked DBpedia instances or ontology IRIs are wrong encoded or incorrectly used. In the case of RDF, this demonstrates that the introduced test approach can validate links between independently produced Linked Open datasets.
Table 5. Custom Construct Validation statistics and triple counts of external RDF NTriples releases on the Databus. Including IRI test coverage and the number of failed tests based on the custom DBpedia Construct Validation.
Limitations. Coverage of Construct Validation. As demonstrated the Construct Validation can test for issues that are not covered by the Syntax or Shape Validation. But for fine-grained testing, to reach a 100% IRI test coverage on an extracted dataset, it is quite hard to define test cases for every used namespace and vocabulary, concerning their encoding and layouts (e.g., external links). Comparison of releases. The number of enabled extractors, produced artifacts, extracted languages, new tests, and mappings can change in newer releases. Therefore, it is challenging to compare evolving releases containing a different set of files and single files that provide more or fewer triples.