VoIDext: Vocabulary and patterns for enhancing interoperable datasets with virtual links

Semantic heterogeneity remains a problem when interoperating with data from sources of different scopes and knowledge domains. Causes for this challenge are context-specific requirements (i.e. no"one model fits all"), different data modelling decisions, domain-specific purposes, and technical constraints. Moreover, even if the problem of semantic heterogeneity among different RDF publishers and knowledge domains is solved, querying and accessing the data of distributed RDF datasets on the Web is not straightforward. This is because of the complex and fastidious process needed to understand how these datasets can be related or linked, and consequently, queried. To address this issue, we propose to extend the existing Vocabulary of Interlinked Datasets (VoID) by introducing new terms such as the Virtual Link Set concept and data model patterns. A virtual link is a connection between resources such as literals and IRIs (Internationalized Resource Identifier) with some commonality where each of these resources is from a different RDF dataset. The links are required in order to understand how to semantically relate datasets. In addition, we describe several benefits of using virtual links to improve interoperability between heterogenous and independent datasets. Finally, we exemplify and apply our approach to multiple world-wide used RDF datasets.


Introduction
To achieve semantic and data interoperability, several data standards, ontologies, thesauri, controlled vocabularies, and taxonomies have been developed and adopted both by academia and industry. For example, the Industry Foundation Classes [8] is an ISO standard to exchange data among Building Information Modelling software tools [14]. In life sciences, we can mention the Gene Ontology (GO) 5 among many other ontologies listed in repositories such as BioPortal [30]. Yet, semantic heterogeneity remains a problem when interoperating with data from various sources which represent the same or related information in different ways [13]. This is mainly due to the lack or difficulty of a common consensus, different modelling decisions, domain scope and purpose, and constraints (e.g. storage, query performance, legacy and new systems).
Semantic reconciliation-i.e. the process of identifying and resolving semantic conflicts [28], for example, by matching concepts from heterogeneous data sources [16]-is recognized as a key process to address the semantic heterogeneity problem. To support this process, ontology matching approaches [25] have been proposed such as YAM++ [23]. Although semantic reconciliation enhances semantic interoperability, it is often not fully applicable or practical (e.g. when no data schema is provided) when considering distributed and independent RDF (Resource Description Framework) datasets of different domain scopes, knowledge domains, and publishers. Indeed, the semantic reconciliation process mostly focuses on aligning concepts and relationships ("terminological boxes" -TBox) rather than improving interoperability at the data level ("assertion boxes" -ABox). In addition, even if the semantic reconciliation process among different RDF publishers and knowledge domains is complete and possible, querying and accessing the data of multiple distributed RDF datasets on the Web is not straightforward. This is because of the complex, time-consuming and fastidious process of having to understand how the data are structured and how these datasets can be related or linked, and consequently, queried.
The contributions of this paper are as follows: -To enhance interoperability and to facilitate the understanding of how multiple datasets can be related and queried, we propose to extend and adapt the existing Vocabulary of Interlinked Datasets (VoID) [3]. VoID is an RDF Schema vocabulary used to describe metadata about RDF datasets such as structural metadata, access metadata and links between datasets. However, VoID is limited regarding terms and design patterns to model the relationships between datasets in a less verbose, unambiguous and explicit way.
-To overcome this problem, we introduce the concept of virtual link set (VLS). A virtual link is an intersection data point between two RDF datasets. A data point is any node or resource in an RDF graph such as literals and IRIs (Internationalized Resource Identifier). An RDF dataset is a set of RDF triples that are published, maintained or aggregated by a single provider [3]. The links are required in order to comprehend how to semantically relate datasets. The major advantage of the VLS-concept is to facilitate the writing of federated SPARQL queries [17], by acting as joint points between the federated sources.
-We exemplify and apply the VoIDext to various world-wide used data sets and discuss both the theoretical and practical implications of these new concepts with the goal of more easily querying heterogeneous and independent datasets. This article is structured as follows: Section 2 presents the most important related work. Section 3 details our approach to extend the VoID vocabulary. In Section 4, we describe the major benefits of using VoIDext, and we apply VoIDext to describe VLSs among three world-wide used bioinformatics RDF data stores. Finally, we conclude this article with future work and perspectives.

Related Work
Since the release of the SPARQL 1.1 Query Language [17] with federated query support in 2013, numerous federated approaches for data and semantic interoperability have recently been proposed [18], [9], [32], and [31]. However, to the best of our knowledge, none of them proposes a vocabulary and patterns to extensively, explicitly and formally describe how the data sources can be interlinked further than only considering "same as"-like mappings such as discussed in Section 3. In effect, existing approaches put the burden on the SPARQL users or systems to find out precisely how to write a conjunctive federated query. An emerging research direction entails automatically discovering links between datasets using Word Embeddings [15]. However, the current focus is mostly on relational data or unstructured data [7]. In addition, several link discovery frameworks such as in [27], [24], and [19] rely on link specifications to define the conditions necessary for linking resources within datasets. However, these specifications focus on describing similarity measures (e.g. Jaccard, Cosine, Trigram) as part of conditions to determine whether two entities should be linked. Nevertheless, they do not consider data transformations to be applied during query execution that are often required to link real-world independent and distributed datasets on the Web. These approaches [15], [27], [24], and [19] are complementary to ours because they can aid in the process of defining virtual link sets.
As related works in terms of RDF-based vocabularies, we can mention the following ones: VoID, SPARQL 1.1 Service Description (SD) 6 , Data Catalog Vocabulary (DCAT) 7 and Simple Knowledge Organization System Reference (SKOS) 8 . Although the VoID RDF schema provides the void:Linkset term (Def. 1), this concept alone is not sufficient to precisely and explicitly define virtual links between the datasets (discussed in Section 3). By precisely, we mean to avoid multiple ways to represent (i.e. triple patterns) and to interpret interlinks. Moreover, by considering Def. 1 extracted from the VoID specification, this definition impedes the use of the void:Linkset concept to describe a link set between instances of the same class because both are triple subjects stored in different datasets. In addition, Vocabularies such as SD, DCAT and SKOS may be used together with VoIDext. Indeed, they are complementary, not mutually exclusive.
Definition 1 (link set -void:Linkset 9 ). A collection of RDF links between two datasets. An RDF link is an RDF triple whose subject and object are described in different datasets [3].

Contribution
To mitigate the impediments of interoperating with distributed and independent RDF datasets, we first propose design patterns of how to partially model virtual links with the current VoID vocabulary and expose its drawbacks. To address these drawbacks, we then propose a new vocabulary (i.e. VoIDext) and demonstrate an unambiguous and unique way to extensively and explicitly describe various types of virtual links (see Def. 2). The VoIDext vocabulary is fully described in [12]. Definition 2 (virtual link set). A set of virtual links. A virtual link is a connection between common resources such as literals and instances from two different RDF datasets. Semantic relaxation (see Def. 3) is also considered when identifying common resources between datasets.
Definition 3 (semantic relaxation). It is the capacity of ignoring semantic and data heterogeneities for the sake of interoperability.
In this article, the words vocabulary and ontology are used interchangeably. The methodology applied to develop VoIDext was inspired by the simplified agile methodology for ontology development (SAMOD) [26]. We mainly chose SAMOD because it is a methodology designed to quickly develop small-and medium-size ontologies and does not require "pair programming"-it usually involves only one ontology engineer. In principle, SAMOD states the involvement of two persona profiles, namely a domain expert and an ontology engineer. The domain expert is mostly required when developing domain ontologies. In the context of VoIDext, a domain expert was not involved, because it is not a domain ontology. Indeed, the proposed VoID extension is a meta-ontology that describes semantic links between RDF datasets -virtual links. Further information about how VoIDext was built by applying the SAMOD methodology is given in the Supplementary Material in [11]. Fig. 1 illustrates a complex virtual link about Swiss cantons between the LINDAS dataset (Linked Data Service 10 ) of the Swiss Government administration and DBpedia [22]. To define this link, some semantic relaxation is applied. This is because heterogeneities are exacerbated when interoperating independent datasets. For example, what is considered a long name of a Swiss canton in LINDAS is actually a short name in DBpedia. In addition, the data types for the name of the canton are not the same in both datasets what impedes exact matching when performing a federated join query. Finally, LINDAS contains a few literals with different concatenated translations of the same canton's name such as "Graubünden / Grigioni / Grischun" that can be matched with the literal "Grisons" asserted as a canton's short name in DBpedia. Indeed, Grison is the French translation of Graubünden -German name. Nevertheless, both datasets share literals with some commonality. By exploring this commonality we are able to define a virtual link set between both datasets. Note that the Swiss cantons' resource IRIs in both datasets are not the same -otherwise defining a virtual link set would be simpler -i.e. a simple link set, see Def. 4. In the next subsections, we incrementally demonstrate with a running example how to model a complex link set (see Def. 6) with VoID and VoIDext terms. Tab. 1 shows other datasets and SPARQL endpoints considered in our examples in this article. Table 1. SPARQL endpoints considered in this article.
Definition 5 (link predicate). According to the VoID specification, a link predicate is the RDF property of the triples in a void:Linkset [3]. Definition 6 (complex link set). It is a complex virtual link set. A complex link set is composed of exactly two link sets xor two shared instance sets (see Def. 7) where xor is the exclusive or.
Definition 7 (shared instance set). A shared instance set between exactly two datasets. For example, two datasets that contain the same OWL/RDFS class instances.

Patterns to model complex link sets with VoID terms
Since our main goal is to facilitate the writing of federated queries, let us suppose that we want to know how to relate Swiss cantons in LINDAS and DBPedia datasets as shown in Fig. 1. In other words, we want to find out the necessary and sufficient graph pattern in the context of Swiss cantons in each dataset to be able to relate them. Further triple patterns such as attributes (e.g. canton's population, cities, acronym) depend on the specificity of the requested information what goes beyond the task of joining the two datasets. Let us further assume a SPARQL user without any previous knowledge about these datasets. A possible workflow for this user to find out how to relate LINDAS and DBPedia in terms of Swiss cantons is described as follows: 1) the user has to dig up the data schema and documentation, if any, looking for the abstract entity "Swiss canton". This task has to be done for both datasets.
2) if (s)he is lucky, a concept is explicitly defined in the data schema. This is the case of the LINDAS dataset that contains the class lindas:Cantonprefixes such as lindas: are defined in Tab. 2. Otherwise the user has to initiate a fastidious quest for assertions and terms that can be used for modeling Swiss canton data. This is the situation of DBpedia where instances are defined as a Swiss canton by assigning the dbrc:Cantons of Switzerland instance of the skos:Concept to the dct:subject property such as the following triple (dbr:Vaud, dct:subject, dbrc:Cantons of Switzerland ).
3) The user has now to browse the RDF graph. For example, by performing additional queries, to be sure that the assertions to the lindas:Canton instances can be used as join points with assertions related to Swiss canton instances in DBpedia. Otherwise, the user has to repeat the previous steps. 4) If data transformations are required because of data and semantic heterogeneities between the datasets, the user has to define resource mappings to be able to effectively perform a federated conjunctive query. Finally, based on that workflow, the SPARQL user can write the query in Listing 1.1 to perform the virtual links concerning Swiss cantons between both datasets. The link set built by intersecting the resources (i.e. the values of lidas:longName and dbp:shortName properties) can then be partially modelled with triple patterns based on VoID terms. This enables a second SPARQL user or system to reuse this link set knowledge to write specialized queries over the two datasets starting from the Swiss canton context. In doing so, the second user avoid the fastidious, complex and time-consuming task of finding this link set.
In addition, to the best of our knowledge there is no system capable of precisely establishing this virtual link set automatically because of the complexity and heterogeneities to be solved.  On the one hand, V L m1 states that a given LS 1 link set targets another LS 2 link set that targets LS 1 back. On the other, V L m2 only states datasets as link set targets. By using the V L m1 model in Fig. 3, the DS 2 instance asserts the DS 1 instance to its void:objectsTarget property. As a reminder, the void:objectsTarget value is the dataset describing the objects of the triples contained in the link set, in our example, the objects of dbp:shortName. This dataset must contain only the relevant triples to describe the virtual link set. In our example in Fig. 3, we define the DS 1 dataset (i.e. a subset of LINDAS) as being also a void:Linkset that contains triples with the lindas:longName predicate. By using the V L m2 model in Fig. 4, the objects' target dataset of the dbp:shortName link predicate is not a void:Linkset but a void:Dataset (i.e. superclass of void:Linkset). DS 1 in Fig. 4 also contains triples with the lindas:longName predicate, however, this predicate is defined as part of a subset and partition of DS 1 by using the void:propertyPartition and void:property terms. Note that solely one void:propertyPartition should be directly assigned to DS 1 dataset, otherwise we are not able to know which predicate should be considered when stating the virtual links.  Yet, we also need to describe further information about the virtual link set such as the domain and range of the link predicates (e.g. lindas:longName and dbp:shortName). This information is used to restrict which resource type must be considered for a given triple that contains the link predicate (e.g. lindas:longName rdfs:domain lindas:Canton). By having this information in advance when writing and executing a federated query, we reduce the number of triples to match, if there are statements of the same predicate but with resources of other types. For example, the lindas:longName property is asserted to instances of lindas:MunicipalityVersion, lindas:Canton or lindas:DistrictEntityVersion. However, for the context of this virtual link set only lindas:Canton instances need to be considered. To restrict the resource types of a given link predicate with VoID, we can state subsets and partitions to a void:Linkset such as illustrated in Fig. 3 by using V L m1 .
Note that for each link predicate's domain/range, we have to create one new subset to be sure that the class partitions of the subset correspond to the domain or range of the link predicate. In addition, if there are multiple resource types to be considered as the domain of a link predicate, we can state multiple class partitions to express the union of types -i.e. classes. Or, we can explicitly define it by using the OWL 2 Description Logic (DL) term owl:unionOf and related patterns to express class union. To express class intersection or other class expressions, we can rely on OWL 2 DL terms and state these class expressions as class partitions of the subset. Similarly, we can model the domain and range of predicates related to virtual links with V L m2 as illustrated in Fig. 4. For the sake of simplicity, we do not depict all predicate domains/ranges in Figures 3  and 4.
However, there are several limitations when only considering VoID terms to model complex link sets.
1) Multiple representations. The VoID vocabulary and documentation due to the lack of constraints and high generalization imply several ways to model virtual link sets such as V L m1 and V L m2 graph patterns to represent a complex link set. In addition, there are various ways to define the link predicate's domain and range. For example, class expressions can be defined by using either OWL 2 DL terms to express the union of classes or multiple class partitions (i.e. void:classPartition assertions), or by combining both of them. This multitude of graph patterns allowed by VoID makes interoperability more complex because we do not previously know how the virtual link set is modelled. Consequently, it requires to build complex parsers and queries to retrieve the virtual link set metadata.
2) Ambiguity. With VoID, we cannot easily distinguish if a link set or dataset is being instantiated to define a virtual link set. For example, we do not know explicitly if two link sets compose a complex link set. Moreover, the use of property/class partitions to define domains and ranges of link predicates can be mixed with void:class assertions that are not part of a domain/range definition. Subsets can also be arbitrarily stated to any link set or dataset what increases the ambiguity to know if a given subset is actually part of a complex link set definition or not. With V L m1 and V L m2 models strictly based on VoID, we cannot explicitly state that the intersections between two link sets occur by matching the subjects-objects, objects-objects or subjects-subjects of link predicates in different link sets. Nevertheless, this information can be derived from the void:objectsTarget and void:subjectsTarget assertions, if any. [5]. By stating the domain and range of link predicates with class expressions based on OWL 2 DL (e.g. a range composed of multiple types/classes), we can take advantage of existing DL-parser and reasoner tools 11 to infer instance types. However, since we can mix DLbased class expressions with void:class assertions, the resulting range/domain expressions are non-compliant with DL. 4) Verbosity. The use of class and property partition partners considerably increases the number of triples to state for representing virtual links. This also increases the complexity of writing of queries to retrieve the virtual link set metadata. 5) Resource mapping. VoID does not provide any explicit term and recommendation to state resource mappings. By doing so, we mitigate or even solve heterogeneities when matching resources with some commonality in different datasets.

3) Description Logic (DL) compliance
In the next subsection, we show how to solve these issues with VoIDext terms and patterns.

Patterns to model complex link sets with VoIDext
To address the issues of modelling virtual link sets solely with VoID, we propose new terms and patterns in VoIDext. Fig. 5 illustrates the main VoIDext terms (see terms with voidext: prefix) and design patterns to model complex link sets. To assert the range and domain of predicates with VoIDext, we can directly assign the voidext:linkPredicateRange and voidext:linkPredicateDomain properties to a link set, respectively (see Def. 8 and Def. 9). Complex link predicates' domains and ranges (e.g. multiple types -union/intersection of classes) must be stated as class expressions by using OWL 2 DL terms (e.g. owl:unionOf ). To avoid ambiguities when interpreting link sets (i.e. a simple set versus a complex one), we can explicitly state that two link sets are indeed part of a complex link set. To do so, we must assign exactly two link sets to a complex link set with the voidext:intersectAt property (see Def. 10). In a complex link set, a link set must be connected to another link set by stating either void:objectsTarget or void:subjectsTarget properties. This allow us to precisely know where the intersection between RDF triples with predicates in different datasets occurs, in other words, the matched RDF resource nodes: object-object, subject-subject, and subject-object. For example, in Fig. 5 with void:objectsTarget property, we state that the lindas:longName predicate's objects in LINDAS match the objects of the dbp:shortName link predicate in DBpedia, and vice-versa. To explicitly state the intersection type (e.g. object-object), we can assert the voidext:intersectionType property (see Def. 11) to a complex link set as shown in Fig. 5.
Definition 8 (link predicate range). The link predicate's object type (i.e. class expression), if any. Moreover, a link set (Def. 1) that is not part of a complex link set (see Def. 6) and connects two datasets through the link predicate's object must specify the link predicate range. Indeed, this object matches a second resource in another dataset. Therefore, the type of this second resource is asserted as the link predicate range.
Definition 9 (link predicate domain). The link predicate's subject type (i.e. class expression), if any.
Definition 10 (intersects at). It specifies the intersection of either exactly two shared instance sets (see Def. 7) or two link sets, that compose a complex link set.
Definition 11 (intersection type). It specifies the intersection type between two RDF triples in different datasets. In other words, if the intersection occurs at the subject xor the object node of a link predicate.
Based on Def. 6, the voidext:ComplexLinkSet OWL class is defined with the following DL expression, IRI prefixes are ignored to improve readability: As a reminder, a void:Dataset is a set of RDF triples from a single provider. However, a complex link set is composed of RDF triples from different providers (e.g. two link predicates). Therefore, we define voidext:ComplexLinkSet as being disjoint with void:Dataset class. Consequently, a complex link set is not a void:Dataset and properties such as void:propertyPartition cannot be assigned to it. To address data heterogeneities, we can implement semantic relaxation by stating the voidext:resourceMapping property (Def. 12) with a literal text based on SPARQL language. In Fig. 5, DS 2 states a mapping in line 5 that converts dbp:shortName language-tagged string values into simple literals and maps the values to a corresponding one in LINDAS dataset. Thus, since this mapping is defined using SPARQL language, it can be directly used to build a SPARQL 1.1 federated query to perform the interlinks between datasets. In Fig. 5, voidext:recommendedMapping (Def. 13) assigns the LINDAS DS 1 link set as the one containing the mapping to be considered when interlinking with DBpedia in the context of Swiss cantons.
Definition 12 (resource mapping). It specifies the mapping function (f m ) to preprocess a resource (i.e. IRI or literal) in a source dataset in order to match another resource in the target dataset. The resource preprocessing (i.e. mapping) must be defined with the SPARQL language by mainly using SPARQL built-ins for assignments (e.g. BIND), and expression and testing values (e.g. IF and FILTER). The BIND built-in is used to assign the output of f m , if any.
Definition 13 (recommended resource mapping). It specifies one recommended mapping function, if more than one mapping is defined in the different sets that are part of a complex link set.
To exemplify a complex link set composed of shared instance sets (Def. 7), let us consider the UniProt and EBI RDF datasets (see Tab. 1). EBI and UniProt RDF data stores use different instance IRIs and classes to represent the organism species, and in a more general way, the taxonomic lineage for organisms. To exemplify this, let us consider the <http://identifiers.org/taxonomy/9606> instance of biopax:BioSource and the <http://purl.uniprot.org/taxonomy/9606> instance of up:Taxon in EBI and UniProt datasets, respectively. Although these instances are not exactly the same (i.e. distinct IRIs, property sets, and contexts), they refer to the same organism species at some extent, namely homo sapienshuman. By applying a semantic relaxation, we can state a virtual link between these two instances. To establish this link, we need to define a resource mapping function (i.e. f m (r)) either to the EBI or UniProt species-related instanceseither f m (<http://identifiers.org/taxonomy/9606 >) ≡<http://purl.uniprot.org /taxonomy/9606 > or f m (<http://purl.uniprot.org/taxonomy/9606 >) ≡<http: //identifiers.org/taxonomy/9606 >. Fig. 6 depicts how this complex link set is modelled with VoIDext-based patterns. Note that it is not possible to define a shared instance set by only using VoID terms because there is no link predicate (Def. 5) associated with the interlinks that are different from rdf:type (see Fig.  7). To address this issue, we can assign a shared instance type (Def. 14) with the voidext:sharedInstanceType property for each voidext:SharedInstanceSet instance (Def. 7). Other examples of complex link sets are available in [11] and [12].
Definition 14. shared instance type: The type (i.e. class) of the shared instances in a given dataset. Shared instances implies equivalent or similar instance IRIs that belong to different datasets.  Fig. 8 shows the representation of a simple link set based on owl:sameAs assertions for musical artists that also includes bands from the MusicBrainz dataset [20] to DBpedia. In Fig. 8, we can easily compare both representations of this simple link set only using VoID terms (left-hand side) with a modelling based on VoIDext (right-hand side). With VoIDext, we explicitly state that this owl:sameAs-link set is also a voidext:SimpleLinkSet -see line 2 in the Fig.  8 right-hand side. Besides this, it is useful to know which are the subjects' types of owl:sameAs to consider since not all owl:sameAs assertions are related to DBpedia, and more specifically to musicians and bands. To do so, we can directly assign the voidext:linkPredicateDomain property to the link set -see line 7 in the Fig. 8 right-hand side. Similarly, we can explicitly declare the objects' types of the link predicate by asserting the voidext:linkPredicateRange property -see line 6. Since it is a simple link set, the range must be defined as an OWL-DL class expression based on the TBox of the dataset where the object is described (i.e. void:objectsTarget). In Fig. 8, the owl:sameAs range is defined with OWL 2 DL as follows: ex:Artist Band ≡ dbo:Artist dbo:Band <http://dbpedia.org/class/yago/Artist109812338>. To explicitly define a simple virtual link set between two datasets that share the same instances of the same class, we can instantiate the class voidext:SharedInstanceSet (Def. 7). An example of a shared instance set between Bgee [6] and OMA datasets is depicted in Listing 1.2.

Patterns to model simple link sets with VoIDext
Therefore, with VoIDext, we can model in a less verbose and more explicit way the simple link sets. We also avoid multiple representations allowed by VoID to define the link predicate's range/domain such as depicted in Subsection 3.1. For instance, VoID does not restrict the subject or object targets to a single class partition, thus if multiple class partitions are defined, we are not able to distinguish which classes refers to the domain/range of the link predicate. Further examples of simple link sets involving the datasets in Tab. 1 and defined with VoIDext are available in [11].

VoIDext benefits and discussions
VoID instances (ABox) are fully backward compatible with the VoIDext schema since we add new terms without modifying the original VoID TBox. The only performed modification concerns the void:target 12 property domain. In VoIDext, this domain is the union of the void:Linkset and voidext:SharedInstanceSet classes instead of solely void:Linkset, as stated in VoID. We did this to avoid the replication of a similar property to state target datasets to shared instance sets. Despite this modification, assertions of void:target based on VoID remain compatible with VoIDext.

Retrieving virtual link sets
Once the virtual links are modelled with VoIDext as discussed in Subsections 3.2 and 3.3, there may exist at most four kinds of virtual link sets as follows: (i) a voidext:ComplexLinkSet composed of void:Linksets -e.g. see Fig. 5; (ii) a voidext:ComplexLinkSet composed of voidext:SharedInstanceSets -e.g. see Fig. 6; (iii) a void:Linkset that is also a voidext:SimpleLinkSet -e.g. see Fig.  8; and (iv) a voidext:SharedInstanceSet that is also a voidext:SimpleLinkSet -exemplified in Listing 1.2. For each kind of virtual link set, a SPARQL query template to retrieve the essential information is asserted as an annotation of the voidext:ComplexLinkSet and voidext:SimpleLinkSet sub-classes of voidext:VirtualLinkSet. These annotations are done by asserting the voidext:query-Linkset and voidext:querySharedInstanceSet properties. Therefore, to retrieve virtual link sets of type (i) and (iii), we can execute the SPARQL queries assigned with voidext:queryLinkset to the voidext:ComplexLinkSet and voidext:Simple-LinkSet classes, respectively. Similarly, to retrieve virtual link sets of type (ii) and (iv), we can execute the SPARQL queries assigned with voidext:query-SharedInstanceSet to the voidext:ComplexLinkSet and voidext:SimpleLinkSet classes, respectively. Due to the limit of pages, these queries are described in [12].

Virtual link set maintenance
Although, to manage the virtual link set evolution is out of the scope of this article, we recommend to annotate the link sets with the issued and modified dates such as illustrated in Fig. 5. This date information helps with the maintenance of virtual link sets. For example, let us suppose the release of a new version of the DBpedia in August, 2019. By checking the difference between the DBpedia new release date and the complex link set issued/modified date (e.g. June 2019, see Fig. 5), it might indicate a possible decrease in the virtual link set performance, or even, invalidity of the interlinks due to the fact of being outdated. In addition, for each virtual link set, we can state the performance in terms of precision, recall, true positives, and so on by asserting the voidext:hasPerformanceMeasure property. The range of this property is mex-perf:PerformanceMeasure 13 . Thus, we can rely on the Mex-perf ontology 13 to describe the virtual link set performances. The complex link set example about Swiss cantons in Fig. 5 has a precision and recall of 100%. In this example, for every Swiss canton in LINDAS exists a corresponding one in DBpedia. Therefore, if this performance is deteriorated after the new release of one of the datasets involved, we should review this virtual link set. Let us consider the example depicted in Fig. 8. This link set that is a voidext:SimpleLinkSet between artists and bands in DBpedia and MusicBrainz. The link set contains fewer links than a complex link set composed of the following link predicates: foaf:name in MusicBrainz and rdfs:label in DBpedia. Indeed, there are 812 against 530 links in this complex link set. This complex link set is the ex:DBPEDIA MUSICBRAINZ VL instance and its serialization in RDF/-Turtle syntax is available in [11]. This set enables to establish links that were not possible with the simple link set because of missing owl:sameAs assertions in the MusicBrainz dataset. For example, this is the case of the "Izaline Calister" artist resource in the MusicBrainz RDF dataset that does not assert the owl:sameAs property with dbr:Izaline Calister IRI from DBpedia. Moreover, we can also rely on the performance metrics to choose a virtual link set among different link sets of the same purpose such as linking artists between DBpedia and MusicBrainz. For example, if we want to write a federated query related to artists that requires a better recall than precision, we should choose the ex:DBPEDIA -MUSICBRAINZ VL based on names/labels rather than the owl:sameAs link set depicted in Fig. 8.

Benefits and a SIB Swiss Institute of Bioinformatics' application
Easing the task of writing SPARQL 1.1 federated queries. The formal description of virtual link sets among multiple RDF datasets on the Web facilitates the manually or (semi-)automatically writing of federated queries. This is because once the virtual link sets are defined between datasets with VoIDext, we can interlink different RDF datasets without requiring to mine this information again from the various ABoxes and TBoxes (including documentation, if any). The mining task becomes more and more complex and fastidious if the TBox is incomplete or missing when comparing with the ABox statements, for example, a triple predicate that is not defined in the TBox.
Applying semantic relaxation rather than semantic reconciliation. The virtual link statements between datasets are more focused on the meaning of interlinking RDF graph nodes rather than the semantics of each node in the different datasets and knowledge domains. For example, let us consider the virtual link illustrated in Fig. 1. When considering solely the LINDAS dataset, the lindas:longName is a rdf:Property labelled as a "District name or official municipality name". In DBpedia, dbp:shortName is a rdf:Property labelled as "short name" and in principle can be applied to any instance. Hence, it is not restricted to district names. In addition, one property is about long names while the other one is about short names. However, they state similar literals in the context of Swiss cantons as discussed in Section 3. Therefore, although these properties are semantically different (hard to reconcile), we can still ignore heterogeneities for the sake of interlinking DBpedia and LINDAS.
Facilitating knowledge discovery. As noticed in [2], yet there are many challenges to address in the semantic web such as the previous knowledge of the existing RDF datasets and how to combine them to process a query. VoIDext mitigates these issues because RDF publishers (including third-party ones) are able to provide virtual link sets which explicitly describe how heterogeneous datasets of distinct domains are related. Without knowing these links, to potentially extract new knowledge that combines these datasets is harder or not even possible. The virtual link sets stated with VoIDext terms provide sufficient machine-readable information to relate the datasets. Nonetheless, the automatic generation of these link sets is out of the scope of this article.
A SIB Swiss Institute of Bioinformatics application. We applied the VoIDext vocabulary in the context of a real case application mainly involving three in production life-sciences datasets available on the Web, namely UniPro-tKB, OMA and Bgee RDF stores -see SPARQL endpoints in Tab. 1. Furthermore, we also provide some virtual link sets that consider DrugBank and EBI datasets. The virtual link RDF serialization among these three databases is available in [11] and it can be queried via the SPARQL endpoint in [10] with query templates defined in [12] as described in Subsection 4.1. Based on these virtual links, a set of more than twelve specialized federated query templates over these data stores was defined 14 . These templates are also available through a template-based search engine 15 . Moreover, as an example of facilitating knowledge discovery, we can mention the virtual link sets between OMA and Bgee. These two distinct biological knowledge domains when combined enable to predict gene expression conservation for orthologous genes (i.e. corresponding genes in different species). Finally, new virtual link sets are being created to support other biological databases in the context of SIBhttps://sib.swiss.

Conclusion
We successfully extended the VoID vocabulary (i.e. VoIDext) to be able to formally describe virtual links and we provided a set of SPARQL query templates to retrieve them. To do so, we applied an agile methodology based on the SAMOD approach. We described the benefits of defining virtual links with VoIDext RDF schema, notably to facilitate the writing of federated queries and knowledge discovery. In addition, with virtual links we can enable interoperability among different knowledge domains without imposing any changes in the original RDF datasets. In the future, we intend to use VoIDext to enhance keyword-search engines over multiple distributed and independent RDF datasets. We also envisage to propose tools to semi-automatically create VoIDext virtual link statements between RDF datasets. Finally, to support virtual link evolution, we aim to develop a tool to automatically detect broken virtual links because of either data schema changes or radical modifications of instances' IRIs and property assertions.