In this section, we present the properties that a sub-graph of our schema is required to have in order to be considered a high-quality summary. Specifically, we are interested in important schema nodes that can describe efficiently the whole schema and reflect the distribution of the data instances at the same time. To capture these properties, we use the notions of relevance and coverage which are further analyzed below. Relevance is used for identifying the most important nodes and coverage is used for extracting nodes/paths, which cover the whole spectrum of the RDF/S document.
3.1 Relevance
Importance has a broad range of meanings and this has led to many different algorithms that try to identify it. Originating from the analysis of social graphs, in the domain of Semantic Web, algorithms adapting the well-known PageRank [4, 5] have been proposed to determine the importance of elements in an XML document. For RDF/S, other approaches use measures such as the degree
centrality, the between-ness and the eigenvector centrality (weighted Page Rank and HITS) [7], adjusting them to the specific features of RDF/S or they try to adapt the degree centrality and the closeness [8] to calculate the relevance of a node.
In our case, we believe that the importance of a node should describe how well a node could represent a part of a KB (its area) giving an intuition about its neighborhood. Intuitively, nodes with many connections in a schema graph will have a high importance. However, since RDF/S KBs might contain huge amounts of data, the latter data should also be involved when trying to estimate a node’s importance.
Consider for example the node “E37 Mark” and the node “E38 Image” in the schema graph of Fig. 1. The two nodes have the same number of connections and they are connected to the same node “E18 Physical Thing”. Now assume that the node “E38 Image” has the double number of instances. Due to the same number of connections, the two nodes may be considered equal but essentially the “E38 Image” is more important for the specific RDF/S KB, due to the higher number of instances it contains. Obviously, the number of instances of the class - that a node corresponds to - is a valuable piece of information for identifying its importance.
In our approach, initially, we determine how central/important a node is, judging from the instances it contains (relative cardinality). After that, we estimate the centrality of a node in the entire KB (in/out centrality), combining the relative cardinality with the number and type of the incoming and outgoing edges in the schema. Finally, the relevance of a schema node is defined by comparing its centrality with the centrality of its neighbors.
Relative Cardinality. The cardinality of a schema node is the number of instances it contains in the current RDF/S KB. If there are many instances of a specific class, then that class is more likely to be more important than another with very few instances. Similarly, the cardinality of an edge between two nodes in a graph is the number of the corresponding instances of the nodes connected with that specific edge. Now we can formally define the relative cardinality of an edge.
Definition 3
(Relative Cardinality of an edge). Let S = (V, E, λ
c
, λ
p
, H) be an RDF schema graph and I = (N, R, τ
v
, τ
c,
τ
p
) the RDF instance graph of S. The relative cardinality of an edge e(v
i
, v
j
) in S, where e ∊ E and v
i
, v
j
∊ V, i.e. the RC(e(v
i
, v
j
)), (remember that λ
p
(e) = p) is the following:
-
In case of available instances: The number of specific instance connections r(n
i
, n
j
) ∊ R, n
i,
n
j
∊ N, where τ
p
(r) = λ
p
(e), τ
c
(n
i
) = λ
c
(v
i
) and τ
c
(n
j
) = λ
c
(v
j
), divided by the total number of the connections (r
k
(n
i
, n
a
), r
t
(n
b
, n
j
) ∊ R, where n
a
, n
b
∊ N) of the instances of these two nodes v
i
, v
j.
A constant value a is added to this number.
-
In case of no available instances: A constant value a.
$$ RC\left( {e\left( {v_{i} ,v_{j} } \right)} \right) = \left\{ {\begin{array}{*{20}c} {\alpha + \frac{{\left| {\left\{ {r_{m} \left( {n_{i} ,n_{j} } \right)} \right\}} \right|}}{{\left| {\left\{ {r_{k} \left( {n_{i} ,n_{a} } \right)} \right\}} \right| + \left| {\left\{ {r_{t} \left( {n_{b} ,n_{j} } \right)} \right\}} \right|}},} & {r_{m} \left( {n_{i} ,n_{j} } \right) \in R} \\ \alpha & {r_{m} \left( {n_{i} ,n_{j} } \right) \notin R} \\ \end{array} } \right\} $$
(1)
The constant value a has the value 1/#connections where #connections is the number of connections e(v
i
, v
j
) that exist in the schema. Our algorithm is flexible enough to focus on the available instances when they exist, and if they are not available, it only exploits the semantics and the structure of the schema.
In/Out Centrality. In order to combine the notion of centrality in the schema and the distribution of the corresponding dataset, we define the in/out centrality, exploiting also the relative cardinality of nodes and edges. The in/out centrality is an adaptation of the degree centrality [7]. In an undirected graph, the degree centrality is defined as the number of links incident upon a node. In a directed graph however, as in our case, the degree centrality is distinguished to the in-degree centrality and the out-degree centrality.
The in-centrality of a schema node v, i.e. C
in
(v), is the sum of the weighted relative cardinalities of the incoming edges. The weights, that are used, are experimentally defined and depend on the types of the properties as they are identified by the function к
P
. As already mentioned, there are two types of properties, the standard RDF types (for example “rdfs:subClassOf”, “rdfs:label”, “rdfs:comment”) and the user defined properties (for example the “P45 consists of”, “P128 carries” shown in Fig. 1). We would like to consider as more important the latter, whereas the former are not considered to be equally important. This is partly because the user-defined properties correlate classes, each exposing the connectivity of the entire schema, in contrast to the hierarchical RDF/S properties.
Definition 4
(in(out)-centrality of a node). Let S be an RDF schema graph and m be the number of the incoming (outgoing) edges e(v
i
, v) (e(v, v
i
)) of a node v in S. The C
in
(v) (C
out
(v)) of v is the sum of the relative cardinality of the edges e(vi, v) (e(v, vi)), multiplied by a weight w
p
according to the type of edge.
$$ C_{in} \left( v \right) = \sum _{ 1}^{m} RC\left( {v_{i} , v} \right) * w_{p} \,\quad C_{out} \left( v \right) = \sum _{ 1}^{m} RC\left( {v, v_{i} } \right) * w_{p} $$
(2)
Relevance. The notion of centrality, as defined previously, is a measure that can give us an intuition about how central a schema node in an RDF/S KB is. However, its importance should be determined considering also the centrality of the other nodes as well. Consider for example, the nodes “E60 Number” and “E56 Language” shown in Fig. 1. They have the same number of incoming and outgoing edges and assume that they have the same number of instances as well. However the “E60 Number” is connected to more important elements compared to the “E56 Language”. For example, the node “E18 Physical Thing” is directly connected to the “E60 Number” and has many other connections and instances. Since the “E18 Physical Thing” is obviously a very important node, the “E60 Number” is a less appropriate node to represent this area in a summary. On the other hand, the “E56 Language” is more relevant than the “E60 Number” to represent the specific part of the graph since its neighbors do not have such a high relevance.
To achieve the aforementioned goal, the relevance of a node is affected by its surrounding neighbors and more specifically by the number and the connections of its adjacent nodes. To be more precise, the formula estimates the (number of) connections of a node and this number is compared to the connections of its neighbors.
Definition 5
(Relevance of a node). Let S be an RDF schema graph, np
in
be the number of incoming nodes v
i
connected to v with e
a
(v
i
, v), and the np
out
be the number of outgoing nodes vj connected to v with e
b
(v, v
j
). The relevance of v, i.e. Relevance(v), is the sum of in and out centrality of v multiplied by the corresponding number of nodes, divided by the sum of out-centrality of the incoming nodes v
i
and the in-centrality of the outgoing nodes v
j
.
$$ \text{Re} levance\left( v \right) = \frac{{C_{in} (v)*np_{in} + C_{out} (v)*np_{out} }}{{\sum\limits_{1}^{{np_{in} }} {\left( {C_{out} (v_{i} )} \right)} + \sum\limits_{1}^{{np_{out} }} {\left( {C_{in} (v_{j} )} \right)} }} $$
(3)
Obviously, the relevance of a schema node in an RDF/S KB is determined by both its connectivity in the schema and the cardinality of the instances. Thus, the number of instances of a node is of vital importance in the assessment procedure. When the data distribution significantly changes, the focus of the entire data source is shifted as well, and as a result, the relevance of the nodes changes. In addition, the importance of each node is compared to the other nodes in the specific area/neighborhood in order to identify the most relevant nodes that can represent all the concepts of a graph.
3.2 Coverage
After having estimated the relevance of each node in the schema graph, it is now time to focus on the paths that exist in a schema graph. The idea behind this is that we are not interested in extracting isolated nodes, but most importantly we want to produce valid sub-schema graphs. So the chosen paths should be selected having in mind to collect the more relevant nodes by minimizing the overlaps.
Definition 6
(Path
v
s
⟶
v
i
). A path from v
s
to v
i
, i.e. v
s
⟶v
i
, is the finite sequence of edges, which connect a sequence of nodes, starting from the node v
s
and ending in the node v
i
.
As a consequence, the relative cardinality of a path is the sum of relative cardinalities of the individual edges. Moreover, the length of a path, i.e. d
vs⟶vi
, is the number of the edges that exist in that path.
In our running example of Fig. 1, the nodes “E53 Place” and “E57 Material” are directly connected to the node “E18 Physical Thing” and have similar connectivity in the graph. The node “E18 Physical Thing” has a high relevance in the graph and as a consequence a great probability to be included in the summary. However, although the “E18 Physical Thing” can be located only in one “E53 Place”, it might consist of many “E57 Material”. As a consequence, the relative cardinality of the path from the “E18 Physical Thing” to the “E57 Material” (RC(e(“E18 Physical Thing”, “E57 Material”))) will be higher than the relative cardinality of the path form “E18 Physical Thing” to “E53 Place”. This means that the path from “E18 Physical Thing” to “E57 Material” is more representative to be included in the summary than the path from “E18 Physical Thing” to “E53 Place”. This is because the “E18 Physical Thing” already covers the “E53 Place” - a physical thing is located only in one place.
In the above example, we dealt with paths of length one. However, the paths included in the summary should contain the most relevant schema nodes which represent the remaining nodes, achieving the digest of the entire content of the RDF/S KB. As a consequence, the main criteria to estimate the level of coverage of a specific path are: (a) the relevance of each node contained in the path, (b) its relevant instances in the dataset and (c) the length of the path. As a result, similar to the approach of Yu et al. [5], we define the notion of coverage as follows:
Definition 7
(Coverage of a path). Let S be an RDF schema graph and I be an instance of S. The coverage of a path v
s
⟶v
i
, i.e. the Coverage(v
s
⟶v
i
), is derived by the sum of the Relevance of the sequential nodes v
j
contained between the nodes v
s
and v
i
, multiplied by the relative cardinality of each edge e(v
j-1
, v
j
) contained in the path. The result is divided by the length of the path in order to penalize the longer paths.
$$ Coverage(v{}_{s} \to v_{i} )_{{}} = \frac{1}{{d_{{n_{s} \to n_{i} }} }}*\sum\limits_{j = 2}^{{d_{{n_{s} \to n_{i} }} }} {\left( {\text{Re} levance(v_{j} )*RC\left( {e\left( {v_{j - 1} ,v_{j} } \right)} \right)} \right)} $$
(4)
The above formula assesses a path and provides a metric to identify the degree of the contained relevant nodes and how this path can represent (a part of) the original graph without overlapping issues. Our goal is to select the schema nodes that are more relevant while avoiding having nodes (or paths) in the summary which cover one another. The highest the coverage of a path, the more relevant this path is considered in representing the original graph or part of it.