Access Control Enforcement for Selective Disclosure of Linked Data
Abstract
The Semantic Web technologies enable Web-scaled data linking between large RDF repositories. However, it happens that organizations cannot publish their whole datasets but only some subsets of them, due to ethical, legal or confidentiality considerations. Different user profiles may have access to different authorized subsets. In this case, selective disclosure appears as a promising incentive for linked data. In this paper, we show that modular, fine-grained and efficient selective disclosure can be achieved on top of existing RDF stores. We use a data-annotation approach to enforce access control policies. Our results are grounded on previously established formal results proposed in [14]. We present an implementation of our ideas and we show that our solution for selective disclosure scales, is independent of the user query language, and incurs reasonable overhead at runtime.
Keywords
RDF Authorization Enforcement Linked Data1 Introduction
The Linked Data movement [5] (aka Web of Data) is about using the Web to create typed links between data from different sources. Technically, Linked Data refers to a set of best practices for publishing and connecting structured data on the Web in such a way that it is machine-readable, its meaning is explicitly defined, it is linked to other external data sets, and can in turn be linked to from external data sets [4]. Linking data distributed across the Web requires a standard mechanism for specifying the existence and meaning of connections between items described in this data. This mechanism is provided by the Resource Description Framework (RDF). Multiple datastores that belong to different thematic domains (government, publications, life sciences, etc.) publish their RDF data on the web^{1}. The size of the Web of Data is estimated to about 85 billions of RDF triples (statements) from more than 3400 open data sets^{2}. One of the challenges of the Linked Data is to encourage businesses and organizations worldwide to publish their RDF data into the linked data global space. Indeed, the published data may be sensitive, and consequently, data providers may avoid to release their sensitive information, unless they are certain that the desired access rights of different accessing entities are enforced properly. Hence the issue of securing RDF content and ensuring the selective exposure of information to different classes of users is becoming all the more important. Several works have been proposed for controlling access to RDF data [1, 6, 7, 9, 10, 11, 13]. In [14], we proposed a fine-grained access control model with a declarative language for defining authorization policies (we call this model AC4RDF in the rest of this paper).
Our enforcement framework allows to define multi-subject policies with a global set of authorizations \(\mathscr {A} \). A subset \(\mathscr {A} _s \subseteq \mathscr {A} \) of authorizations is associated to each subject S who executes a (SPARQL) query. The subject’s policy is then enforced by AC4RDF which computes the subgraph corresponding to the triples accessible by the authenticated subject. We use an annotation based approach to enforce multi-subject policies: the idea is to materialize every triple’s applicable authorizations of the global policy, into a bitset which is used to annotate the triple. The base graph G is transformed into a graph \(G^\mathscr {A} \) by annotating every triple \(t\in G\) with a bitset representing its set of applicable authorizations \(\mathsf {ar}(G, \mathscr {A})(t) \subseteq \mathscr {A} \). The subjects are similarly assigned to a bitset which represents the set of authorizations assigned to them. When a subject sends a query, the system evaluates it over the her/his positive subgraph. In Sect. 3 we give an overview about RDF data model and SPARQL query language. In Sect. 4 we give the semantics of AC4RDF model which are defined using positive subgraph from the base graph. In Sect. 5 we propose an enforcement approach of AC4RDF model in multiple-subject context. We present and prove the correctness of our encoding approach. In Sect. 6 we give details about the implementation and experimental results.
2 Related Work
The enforcement techniques can be categorized into three approaches: pre-processing, post-processing and annotation based.
The pre-processing approaches enforce the policy before evaluating the query. For instance, the query rewriting technique consists of reformulating the user query using the access control policy. The new reformulated query is then evaluated over the original data source returning the accessible data only. This technique was used by Costabello et al. [6] and Abel et al. [1].
In the post-processing approaches, the query is evaluated over the original data source. The result of the query is then filtered using the access control policy to return the accessible data. Reddivari et al. [13] use a post-processing approach to enforce their models.
In the annotation based approaches, every triple is annotated with the access control information. During query evaluation, only the triples annotated with a permit access are returned to the user. This technique is used by Papakonstantinou et al. [11], Jain et al. [9], Lopes et al. [10] and Flouris et al. [7].
The advantage of the pre-processing approaches such as query rewriting, is that the policy enforcer is independent from RDF data. In other words, any updates on data would not affect the policy enforcement. On the other hand, this technique fully depends on the query language. Moreover, the query evaluation time may depend on the policy. The experiments in [1] showed that the query evaluation overhead grows when the number of authorizations grows, in contrast to our solution which does not depend on the number of authorizations. In the post-processing approaches, the query response time may be considerably longer since policies are enforced after all data (allowed and not allowed) have been processed. The data-annotation approach gives a fast query answering, since the triples are already annotated with the access information and only the triples with a grant access can be used to answer the query. On the other hand, any updates in the data would require the re-computation of annotations.
Some works [11] support incremental re-computation of the annotated triples after data updates. In this paper, we do not handle data updates and we leave the incremental re-computation as future work.
In the data-annotation based approaches that hard-code the conflict resolution strategy [7], annotations are fully dependent on the used strategy so they need to be recomputed in case of change of the strategy. Our encoding is independent of the conflict resolution strategy function which is evaluated at query time, which means that changing the strategy does not impact the annotations.
As the semantics of an RDF graph are given by its closure, it is important for an access control model to take into account the implicit knowledge held by this graph. In the Semantic Web context, the policy authorizations deny or allow access to triples whether they are implicit or not. In [13] the implicit triples are checked at query time. Inference is computed during every query evaluation, and if one of the triples in the query result could be inferred from a denied triple, then it is not added to the result. Hence the query evaluation may be costly since there is a need to use the reasoner for every query to compute inferences. To protect implicit triples, [9, 10] and [11] proposed a propagation technique where the implicit triples are automatically labeled on the basis of the labels assigned to the triples used for inference. Hence if one of the triples used for inference is denied, then the inferred triple is also denied. This introduces a new form of inference anomalies where if a triple is explicit (stored) then it is allowed, however, if the triple is inferred then it is denied. We illustrate with the following example.
Example 1
Let us consider the graph \(G_0\) of Fig. 1. Suppose we want to protect \(G_0\) by applying the policy \(P=\){deny access to triples with type : Cancerous, allow access to all resources which are instance of : Patient}. The triple \(t_{9}\) is inferred from \(t_{2}\) and \(t_{7}\) using the RDFS subClassOf inheritance rule. With the propagation approaches which consider inference [9, 10, 11], the triple \(t_{9}\) = \(( \!:\!alice \,; rdf\!:\!type \,; \!:\!Patient )\}\) will be denied since it is inferred from denied triples (\(t_{7}\)). Hence the fact that alice is a patient will not be returned in the result even though the policy clearly allows access to it. Moreover, such a triple could also have been part of the explicit triples and this could change its accessibility to the subject even though the graph semantics do not change.
In our model, explicit and implicit triples are handled homogeneously to avoid this kind of inference anomalies.
3 RDF Data Model
“Graph database models can be defined as those in which data structures for the schema and instances are modeled as graphs or generalizations of them, and data manipulation is expressed by graph-oriented operations and type constructors” [2]. The graph data model used in the semantic web is RDF (Resource Description Framework) [8]. RDF allows decomposition of knowledge in small portions called triples. A triple has the form “\(( subject \,; predicate \,; object )\)” built from pairwise disjoint countably infinite sets \(\mathsf {I}\), \(\mathsf {B}\), and \(\mathsf {L}\) for IRIs (Internationalized Resource Identifiers), blank nodes, and literals respectively. The subject represents the resource for which information is stored and is identified by an IRI. The predicate is a property or a characteristic of the subject and is identified by an IRI. The object is the value of the property and is represented by an IRI of another resource or a literal. In RDF, a resource which does not have an IRI can be identified using a blank node. Blank nodes are used to represent these unknown resources, and also used when the relationship between a subject node and an object node is n-ary (as is the case with collections). For ease of notation, in RDF, one may define a prefix to represent a namespace, such as \(\mathtt {rdf:type}\) where \(\mathtt {rdf}\) represents the namespace http://www.w3.org/1999/02/22-rdf-syntax-ns.
Note 1
In this paper, we explicitly write \(\mathtt {rdf}\) and \(\mathtt {rdfs}\) when the term is from the RDF or the RDFS standard vocabulary. However, we do not prefix the other terms for the sake of simplicity.
For instance the triple \(( \!:\!alice \,; \!:\!hasTumor \,; \!:\!breastTumor )\) states that alice has a breast tumor. A collection of RDF triples is called an RDF Graph and can be intuitively understood as a directed labeled graph where resources represent the nodes and the predicates the arcs as shown by the example RDF graph \(G_0\) in Fig. 1.
Definition 1
(RDF graph). An RDF graph (or simply “graph”, where unambiguous) is a finite set of RDF triples.
Example 2
Figure 1 depicts a graph \(G_0\) constituted by triples \(t_{1}\) to \(t_{9}\), both pictorially and textually.
We reuse the formal definitions and notation used by Pérez et al. [12]. Throughout this paper, \(\mathscr {P}(\mathsf {E})\) denotes the finite powerset of a set \(\mathsf {E}\) and \(\mathsf {F} \subseteq \mathsf {E}\) denotes a finite subset\(\mathsf {F}\) of a set \(\mathsf {E}\).
3.1 SPARQL
An RDF query language is a formal language used for querying RDF triples from an RDF store also called triple store. An RDF store is a database specially designed for storing and retrieving RDF triples. SPARQL (SPARQL Protocol and RDF Query Language) is a W3C recommendation which has established itself as the de facto language for querying RDF data. SPARQL borrowed part of its syntax from the popular and widely adopted SQL (Structured Query Language). The main mechanism for computing query results in SPARQL is subgraph matching: RDF triples in both the queried RDF data and the query patterns are interpreted as nodes and edges of directed graphs, and the resulting query graph is matched to the data graph using variables.
Definition 2
(Triple Pattern, Graph Pattern). A term\( t \) is either an IRI, a variable or a literal. Formally \( t \in \mathsf {T} =\mathsf {I} \cup \mathsf {V} \cup \mathsf {L} \). A tuple \(t \in \mathsf {TP} = \mathsf {T} \times \mathsf {T} \times \mathsf {T} \) is called a Triple Pattern (TP). A Basic Graph Pattern (BGP), or simply a graph, is a finite set of triple patterns. Formally, the set of all BGPs is \(\mathsf {BGP} = \mathscr {P}(\mathsf {TP})\).
Given a triple pattern \(tp\in \mathsf {TP} \), \({{\mathrm{\mathsf {var}}}}(tp)\) is the set of variables occurring in tp. Similarly, given a basic graph pattern \(B\in \mathsf {BGP} \), \({{\mathrm{\mathsf {var}}}}(B)\) is the set of variables occurring in the BGP defined by \({{\mathrm{\mathsf {var}}}}(B) = \{ v \mid \exists tp \in B\wedge v \in {{\mathrm{\mathsf {var}}}}(tp) \}\).
In this paper, we do not make any formal difference between a basic graph pattern and a graph. When graph patterns are considered as instances stored in an RDF store, we simply call them graphs.
The evaluation of a graph pattern \(B\) on another graph pattern G is given by mapping the variables of \(B\) to the terms of G such that the structure of \(B\) is preserved. First, we define the substitution mappings as usual. Then, we define the evaluation of \(B\) on G as the set of substitutions that embed \(B\) into G.
Definition 3
(Substitution Mappings). A substitution (mapping)\(\eta \) is a partial function \(\eta : \mathsf {V} \rightarrow \mathsf {T} \). The domain of \(\eta \), \({{\mathrm{\mathsf {dom}}}}(\eta )\), is the subset of \(\mathsf {V} \) where \(\eta \) is defined. We overload notation and also write \(\eta \) for the partial function \(\eta ^\star :\mathsf {T} \rightarrow \mathsf {T} \) that extends \(\eta \) with the identity on terms. Given two substitutions \(\eta _1\) and \(\eta _2\), we write \(\eta = \eta _1\eta _2\) for the substitution \(\eta : ?v \mapsto \eta _2(\eta _1( ?v ))\) when defined.
Given a triple pattern \(tp=( s \,; p \,; o )\in \mathsf {TP} \) and a substitution \(\eta \) such that \({{\mathrm{\mathsf {var}}}}(tp) \subseteq {{\mathrm{\mathsf {dom}}}}(\eta )\), \((tp)\eta \) is defined as \(( \eta (s) \,; \eta (p) \,; \eta (o) )\). Similarly, given a graph pattern \(B\in \mathsf {BGP} \) and a substitution \(\eta \) such that \({{\mathrm{\mathsf {var}}}}(B) \subseteq {{\mathrm{\mathsf {dom}}}}(\eta )\), we extend \(\eta \) to graph pattern by defining \((B)\eta = \{ (tp)\eta \mid tp \in B\}\).
Definition 4
Example 3
Let \(B\) be defined as \(B= \{( ?d \,; \!:\!service \,; ?s ),\)\(( ?d \,; \!:\!treats \,; ?p )\}\). \(B\) returns the doctors, their services and the patients they treat. The evaluation of \(B\) on the example graph \(G_0\) of Fig. 1 is \(\llbracket {B} \rrbracket _{G_0} = \{\eta \}\), where \(\eta \) is defined as \(\eta : ?d \mapsto \, \!:\!bob \), \( ?s \mapsto \, \!:\!onc \) and \( ?p \mapsto \, \!:\!alice \).
Formally, the definition of BGP evaluation captures the semantics of SPARQL restricted to the conjunctive fragment of SELECT queries that do not use FILTER, OPT and UNION operators (see [12] for further details).
Another key concept of the Semantic Web is named graphs in which a set of RDF triples is identified using an IRI forming a quad. This allows to represent meta-information about RDF data such as provenance information and context. In order to handle named graphs, SPARQL defines the concept of dataset. A dataset is a set composed of a distinguished graph called the default graph and pairs comprising an IRI and an RDF graph constituting named graphs.
Definition 5
4 Access Control Semantics
AC4RDF semantics is defined using authorization policies. An authorization policy P is defined as a pair \(P=(\mathscr {A}, \mathsf {ch})\) where \(\mathscr {A} \) is a set of authorizations and \(\mathsf {ch}: \mathscr {P}(\mathscr {A}) \rightarrow \mathscr {A} \) is a so called (abstract) conflict resolution function that picks out a unique authorization when several ones are applicable. The semantics of the access control model are given by means of the positive (authorized) subgraph \(G^+\) obtained by evaluating P on a base RDF graph G.
4.1 Authorization Semantics
Authorizations are defined using basic SPARQL constructions, namely basic graph patterns, in order to facilitate the administration of access control and to include homogeneously authorizations into concrete RDF stores without additional query mechanisms. In the following definition, effect \(+\) (resp. −) stands for access to be granted (resp. denied).
Definition 6
(Authorization). Let \(\mathsf {Eff} = \{{{\mathrm{+}}}, -\}\) be the set of applicable effects. Formally, an authorizationOpen image in new window is a element of \(\mathsf {Auth} = \mathsf {Eff} \times \mathsf {TP} \times \mathsf {BGP} \). The component e is called the effect of the authorization Open image in new window, h and b are called its head and body respectively. The function \({{\mathrm{\mathsf {effect}}}}: \mathsf {Auth} {{\mathrm{\rightarrow }}}\mathsf {Eff} \) (resp., \({{\mathrm{\mathsf {head}}}}: \mathsf {Auth} {{\mathrm{\rightarrow }}}\mathsf {TP} \), \({{\mathrm{\mathsf {body}}}}: \mathsf {Auth} {{\mathrm{\rightarrow }}}\mathsf {BGP} \)) is used to denote the first (resp., second, third) projection function. The set Open image in new window is called the underlying graph pattern of the authorization Open image in new window.
The concrete syntax “\(\texttt {GRANT}/\texttt {DENY} \; h \; \texttt {WHERE} \; b \)” is used to represent an authorization Open image in new window. The \(\texttt {GRANT}\) keyword is used when \(e={{\mathrm{+}}}\) and the \(\texttt {DENY}\) keyword when \(e=-\). Condition \(\texttt {WHERE} \; \emptyset \) is elided when b is empty.
Example 4
Consider the set of authorizations shown in Table 1. Authorization Open image in new window grants access to triples with predicate : hasTumor. Authorization Open image in new window states that all triples of type : Cancerous are denied. Authorizations Open image in new window and Open image in new window state that triples with predicate : service and : treats respectively are permitted. Authorization Open image in new window states that triples about admission to the oncology service are specifically denied, whereas the authorization Open image in new window states that such information are allowed in the general case. Open image in new window grants access to predicates’ domains and Open image in new window denies access to any triple which object is : Cancerous. Finally, authorization Open image in new window denies access to any triple, it is meant to be a default authorization.
Example of authorizations
Given an authorization Open image in new window and a graph G, we say that Open image in new window is applicable to a triple \(t \in G\) if there exists a substitution \(\theta \) such that the head of Open image in new window is mapped to t and all the conditions expressed in the body of Open image in new window are satisfied as well. In other words, we evaluate the underlying graph pattern Open image in new window against G and we apply all the answers of Open image in new window to Open image in new window in order to know which \(t \in G\) the authorization Open image in new window applies to. In the concrete system we implemented, this evaluation step is computed using the mechanisms used to evaluate SPARQL queries. In fact, given an authorization Open image in new window, the latter is translated to a SPARQL CONSTRUCT query which is evaluated over G. The result represents the triples over which Open image in new window is applicable.
Definition 7
Example 5
Consider the graph \(G_0\) shown in Fig. 1 and the set of authorizations \(\mathscr {A} \) shown in Table 1. The applicable authorizations on triple \(t_{8}\) are computed to Open image in new window.
The set of triples in a given graph G to which an authorization Open image in new window is applicable, is called the scope of Open image in new window in G.
Definition 8
Example 6
Consider authorization Open image in new window in Table 1, and the graph \(G_0\) in Fig. 1. The scope of Open image in new window is computed as follows: Open image in new window.
4.2 Policy and Conflict Resolution Function
In the context of access control with both positive (grant) and negative (deny) authorizations, policies must deal with two issues: inconsistency and incompleteness. Inconsistency occurs when multiple authorizations with different effects are applicable to the same triple. Incompleteness occurs when triples have no applicable authorizations. Inconsistency is resolved using a conflict resolution strategy which selects one authorizations when more than one applies. Incompleteness is resolved using a default strategy which is an effect that is applied to the triples with no applicable authorizations. In [14], we abstracted from the details of the concrete resolution strategies by assuming that there exists a choice function that, given a finite set of possibly conflicting authorizations, picks a unique one out.
Definition 9
(Authorization Policy). An (authorization) policy P is a pair \(P = (\mathscr {A}, \mathsf {ch})\), where \(\mathscr {A} \) is a finite set of authorizations and \(\mathsf {ch}: \mathscr {P}(\mathscr {A}) \rightarrow \mathscr {A} \) is a conflict resolution function.
Example 7
An example policy is \(P=(\mathscr {A},\mathsf {ch})\) where \(\mathscr {A} \) is the set of authorizations in Table 1 and \(\mathsf {ch} \) is defined as follows. For all non-empty subset \(\mathscr {B} \) of \(\mathscr {A} \), \(\mathsf {ch} (\mathscr {B})\) is the first authorization (using syntactical order of Table 1) of \(\mathscr {A} \) that appears in \(\mathscr {B} \). For \(\mathscr {B} =\emptyset \), Open image in new window.
The semantics of policies are given by composing the functions \(\mathsf {ar}\), \(\mathsf {ch} \) and then \({{\mathrm{\mathsf {effect}}}}\) in order to compute the authorized subgraph of a given graph.
Definition 10
Example 8
Let us consider the policy \(P=(\mathscr {A},\mathsf {ch})\) defined in Example 7 and the graph \(G_0\) of Fig. 1. Regarding the triple \(t_{8} =( \!:\!alice \,; \!:\!admitted \,; \!:\!onc )\), Open image in new window is the first among authorizations in Table 1 and its effect is −, we deduce that \(t_{8} \not \in {G_0}^{{{\mathrm{+}}}}_{P}\). By applying a similar reasoning on all triples in \(G_0\), we obtain \({G_0}^{{{\mathrm{+}}}}_{P} = \{t_{1},t_{4},t_{5},t_{6} \}\).
5 Policy Enforcement
To enforce AC4RDF model, we use an annotation approach which materializes the applicable authorizations in an annotated graph denoted by \(G^\mathscr {A} \). The latter is computed once and for all at design time. The subjects’ queries are evaluated over the annotated graph with respect to their assigned authorizations. In the following, we show how the base graph triples are annotated and how the subjects queries are evaluated.
5.1 Graph Annotation
From a conceptual point of view, an annotated triple can be represented by adding a fourth component to a triple hence obtaining a so called quad. From a physical point of view, the annotation can be stored in the graph name of the SPARQL dataset (Definition 5). To annotate the base graph, we use the graph name IRI of the dataset to store a bitset representing the applicable authorizations of each triple. First we need a bijective function \({{\mathrm{\mathsf {authToBs}}}}\) which maps a set of authorizations to an IRI representing its bitset. Authorizations are simply mapped to their position in the syntactical order of authorization definitions. In other words, given an authorization Open image in new window and a set authorizations \(\mathscr {A} _S\) to be mapped, the i-th bit is set to 1 in the generated bitset if Open image in new window. \({{\mathrm{\mathsf {authToBs}}}}^{\mathsf {-1}}\) is the inverse function of \({{\mathrm{\mathsf {authToBs}}}}\).
Example 9
Consider the policy \(P=(\mathscr {A},\mathsf {ch})\) defined in Example 7 and the graph \(G_0\) of Fig. 1. Open image in new window.
Now we are ready to define the dataset representing the annotated graph.
Definition 11
Definition 11 defines how to annotate the base graph G given a set of authorization. The following Lemma 1 ensures that \(G^\mathscr {A} \) forms a partition of the base graph G.
Lemma 1
\(\forall i,j \in 1..n : i \ne j \implies G_i \cap G_j = \emptyset \)
\(\bigcup _{i \in 1..n} G_i = G \)
5.2 Subject’s Query Evaluation
The subject is the entity requesting access to the triple store. The determination of the objects accessible by the subject could be based on the subject identity, role or attributes. Given a global set of authorizations \(\mathscr {A} \) we suppose that the subset \(\mathcal {A}_s\) assigned to the subject is known in advance. The upstream authentication and determination of the authorizations assigned to the subjects is out of the scope of this paper.
Following Definition 10, given a global policy authorization set \(\mathscr {A} \), the positive subgraph of a subject having \(\mathcal {A}_s \subseteq \mathscr {A} \) as applicable authorizations, is given by the following: \({G}^{{{\mathrm{+}}}}_{s} = \{ t \in G \mid ({{\mathrm{\mathsf {effect}}}}\, {{\mathrm{\circ }}}\, \mathsf {ch})(\mathsf {ar}(G,\mathcal {A}_s)(t))={{\mathrm{+}}}\}\). Since we materialized the set of applicable authorizations in \(G^\mathscr {A} \), we need to define the subject’s positive subgraph from the graph annotation, more precisely from \(\mathsf {ar}(G,\mathscr {A})\). The following lemma shows that \(\mathsf {ar}(G,\mathcal {A}_s)\) can be computed from \(\mathcal {A}_s \) and \(\mathsf {ar}(G,\mathscr {A})\).
Lemma 2
Similarly to the triples, subjects are assigned to bitsets representing authorizations applicable to them. If a subject authorization set is \(\mathcal {A}_s \), then she/he is assigned a bitset ubs where the i-th bit is set to 1 if Open image in new window.
Example 10
Given the set of authorizations \(\mathscr {A} \) in Table 1. Eve is a nurse who can see information about patients having tumors Open image in new window and which service they are admitted to Open image in new window. She is denied anything else Open image in new window. Her assigned bitset is the bitset 100001001 of Table 2. Dave belongs to the administrative staff, he can access doctors services assignment Open image in new window and the patients they treat Open image in new window. He is denied anything else Open image in new window. His assigned bitset is the bitset 001100001 of Table 2.
Once the graph is annotated, it is made available to the subjects with a filter function which prunes out the inaccessible triples given the subjects’s authorization set. In other words, the \({{\mathrm{\mathsf {filter}}}}\) function returns the subjects’s positive subgraph by applying the \(\mathsf {ch} \) function on the subject’s assigned authorizations \(\mathsf {ar}(G,\mathcal {A}_s)(t)\). We showed in Lemma 2 that this subset can be obtained from the applicable authorizations in \(G^\mathscr {A} \) by computing a bitwise logical and (denoted by&) between the subject’s and triples’ bitsets.
Definition 12
Once the subject’s positive subgraph is computed with \({{\mathrm{\mathsf {filter}}}}\), the subject’s query Q is then evaluated over it returning \(\llbracket {Q} \rrbracket _{{{\mathrm{\mathsf {filter}}}}(G^{\mathscr {A}})(ubs)}\) to the subject. The following theorem shows that \({{\mathrm{\mathsf {filter}}}}\) function applied to the annotated graph returns the subject’s positive subgraph^{3}.
Theorem 1
Let \(\mathscr {A}\) be a set of authorizations, \(P=(\mathscr {A} _s,\mathsf {ch})\) be the policy of subject s and ubs her/his associated bitset. If G is a graph and \(G^{\mathscr {A}}\) its annotated version, then \({{\mathrm{\mathsf {filter}}}}(G^{\mathscr {A}})(ubs) = {G}^{{{\mathrm{+}}}}_{s}\).
Example 11
Let us consider the policy \(P=(\mathscr {A},\mathsf {ch})\) of Example 7. Table 2 illustrates the annotated graph obtained from \(G_0\) shown in Fig. 1, as well as the two users of Example 10 with their assigned authorizations. The \({{\mathrm{\mathsf {filter}}}}\) function will compute the positive subgraph of Eve as follows: \({{\mathrm{\mathsf {filter}}}}(G_0^{\mathscr {A}})(100001001) = \{t_{4},t_{8} \}\). Similarly, Dave’s positive subgraph equals \(\{t_{5},t_{6} \}\).
Example of annotated graph and users bitsets
\(G_0^{\mathscr {A}}\) | ubs&u | ||
---|---|---|---|
u | G | Eve | Dave |
100001001 | 001100001 | ||
000000111 | {\(t_{1} \)} | 000000001 | 000000001 |
000000001 | {\(t_{2},t_{3},t_{9} \)} | 000000001 | 000000001 |
100000001 | {\(t_{4} \)} | 100000001 | 000000001 |
001000001 | {\(t_{5} \)} | 000000001 | 001000001 |
000100001 | {\(t_{6} \)} | 000000001 | 000100001 |
010000011 | {\(t_{7} \)} | 000000001 | 000000001 |
000011001 | {\(t_{8} \)} | 000001001 | 000000001 |
6 Implementation
Our system is implemented using the JenaJava API on top of the Jena TDB^{4} (quad) store. Apache Jena is an open source Java framework which provides an API to manage RDF data. ARQ^{5} is a SPARQL query engine for Jena which allows querying and updating RDF models through the SPARQL standards. ARQ supports custom aggregation and GROUP BY, FILTER functions and path queries. Jena TDB is a native RDF store which allows to store and query RDF quads.
To generate \(G^{\mathscr {A}}\), the dataset of annotated triples, we use SPARQL CONSTRUCT queries to obtain authorizations scopes (see Definition 8). An authorization Open image in new window is transformed into Open image in new windowWHEREOpen image in new window. We use an in-memory hash map in which we store the ids of the triples and the correspondent bitset. For every authorization Open image in new window, a CONSTRUCT query Open image in new window is run over the raw dataset, and the result triples are added/updated to the hash map with the bit i set to 1. Once the hash map is computed, it is written into a dataset which represents \(G^{\mathscr {A}}\). Note that we could have used the dataset directly instead of a hash map, but this would be time consuming due to the high number of disk accesses. In case of a high number of triples that can’t hold in memory, we could use a hybrid approach by loading the triples partially, but this extension is left for future work.
During query evaluation, on the fly filtering is applied to the accessed triples. Jena TDB provides a low level quad filter hook^{6} that we use for implementation. For each accessed quad, let u be the quad’s graph IRI, t its triple and ubs be the subject’s bitset. A bitwise logical and is performed between (the bitset represented by) u and ubs. The \(\mathsf {ch} \) function on the authorizations obtained by \({{\mathrm{\mathsf {authToBs}}}}^{\mathsf {-1}}\) is then applied in order to allow or deny access to t. If t is allowed, then it is transmitted to the ARQ engine to be used by query Q. Otherwise, it will be hidden to the ARQ engine. An in-memory cache is used to map quad graph IRIs to grant/deny decisions in order to speedup the filtering process.
6.1 Experiments
Summary of notations
in | \(|G|\) | Size of the LUBM dataset |
in | \(|\mathscr {A} |\) | Number of authorizations |
in | \(|G^+|_{|G|}\) | Positive subgraph size w.r.t. raw dataset size |
in | \(|\mathscr {A} _s|\) | Number of authorizations assigned to the subject |
in | \(Q_s\) | LUBM test Query |
out | \(t_{A}\) | Time to build \(G^\mathscr {A} \) in memory |
out | \(t_{W}\) | Time to write \(G^\mathscr {A} \) to disk |
out | \(t_{G^+}\) | Time to evaluate Q on materialized \(G^+\) |
out | \(t_{G^\mathscr {A}}\) | Time to evaluate Q on \(G^\mathscr {A} \) |
out | \(t_{G}\) | Time to evaluate Q on (raw) G |
Static Performance. We distinguish the time needed to compute \(G^\mathscr {A} \) between the time required for its building and the time required for its writing. The time to build the authorization bitset \(\mathsf {ar}(G,t)\) associated with each triple \(t\in G\) in memory is referred to as \(t_{A} \) in Table 3. The time to write the annotated graph \(G^\mathscr {A} \) from the memory to the quad store is referred to as \(t_{W} \) in Table 3. Figure 2 shows \(t_{A} \) and \(t_{W} \) with \(|\mathscr {A} | \) being set to 100 authorizations. Figure 3 shows \(t_{A} \) and \(t_{W} \) with \(|G| \) being set to 1,591 k triples. As each Open image in new window is mapped to a SPARQL CONSTRUCT query, the results show that \(t_{A} \) grows linearly when \(|G| \) or \(|\mathscr {A} | \) gets bigger. The annotation time is not negligible but we argue that it is not an issue: \(G^\mathscr {A} \) is computed once, as long as \(\mathscr {A} \) is not modified. The ratio \(t_{A}/ t_{W} \) is about 3.4 on average for fixed value of G in Fig. 2. In other words, for 100 authorizations, our method is amortized if the sum of triples in the positive subgraph for each subject is approximatively 5 times greater then the number of triples in base graph. Figure 3 shows that \(t_{A} \) grows linearly when \(|\mathscr {A} | \) grows. However, as expected the results show that \(t_{W} \) is independent of \(|\mathscr {A} | \): the overhead incurred by the growing size of the bitsets is negligible for \(|\mathscr {A} | \in \{50,100,150,200\}\). On average, the annotated graph \(G^\mathscr {A} \) requires 50 % more disk space than G.
Dynamic Performance. To evaluate the performance of our solution at runtime, we compare our approach to two extreme methods. Each method computes the positive subgraph \(G^+\) obtained by filtering the result of query Q on a base graph G according to a set \(\mathscr {A} \) of authorizations.
The first extreme (naive) method gives an upper bound on the overhead incurred by the filtering process. Indeed, in the post-processing approaches, the access control consists in two steps: (1) compute the full answer Q(G) and (2) filter out the denied triples from Q(G) as a post-processing step. This method avoids duplication of the base graph G at the price of high overhead at runtime. In our experiments, we considered the step (1) only, by computing the full answer Q(G). We refer to this method as \(t_{G} \) in Table 3. The second extreme method gives a lower bound on the overhead incurred by the filtering process. The idea is to materialize \(G^+\) for each user profile and then compute \(Q(G^+)\). We refer to this method as \(t_{G^+} \) in Table 3. This method avoids the filtering post-process at the price of massive duplication and storage overhead. In contrast, our approach, namely \(t_{G^\mathscr {A}} \) in Table 3, is a trade-off between the extreme ones: it needs some static computation while offering competitive runtime performance. Our results are shown in Fig. 4 for varying sizes of \(|G| \) with \(|\mathscr {A} | \) and \(|\mathscr {A} _s| \) set to 100, and \(|G^+|_{|G|} \) set to 40 %. The subject query \(Q_s\) is set to the worst case which is the select all query. The key insight from these experiments is that the overhead is independent from\(|G|\) and is about 50 %.
Another advantage of our approach is its independence from the number of authorizations of both the policy and those assigned to the subject. In Fig. 5 we vary the number of policy authorizations (\(|\mathscr {A} | \)) with \(|G| \) set to 1,591 k triples and \(Q_s\) to the select all query. The experiments show a constant overhead while changing \(|\mathscr {A} | \).
Regarding \(|G^+|_{|G|}\), the size of the positive subgraph with respect to the size of the annotated graph, the experiments in Fig. 6 show that the query answer time \(t_{G^\mathscr {A}} \) grows linearly when \(|G^+|_{|G|}\) grows, with \(|G| \) fixed to 1,591 k and \(|\mathscr {A} | \) and \(|\mathscr {A} _s| \) fixed to 100. \(Q_s\) being the select all query. This shows that the overhead w.r.t. a materialized \(Q(G^+)\) does not depend on the size of the positive subgraph. Note that \(t_G\) does not vary since we did not consider the filtering step of post-processing approaches, otherwise it would grow linearly when \(|G^+|_{|G|}\) grows.
In Fig. 7 we run experiments on our system with a subset of LUBM test queries used by [3] with \(|\mathscr {A} | \) and \(|\mathscr {A} _s| \) set to 100, and \(|G^+|_{|G|} \) set to 40 %. We computed the LUBM queries evaluation times and repeated the experiments 100 times. Q1 and Q3 are more complex queries having a high number of initial triples associated with the triple patterns, but the final number of results is quite small (28 and 0 respectively). Figure 7 shows that the time to evaluate query Q3 in presence of the filter \(t_{G^\mathscr {A}} \) is smaller than the evaluation time over materialized positive subgraph \(t_{G^+} \). The reasons could be the empty result of Q3 or different execution plans. In the rest of the queries, the overhead was between 6 and 40 %.
7 Conclusion
In this paper, we proposed an enforcement framework to the access control model for RDF we defined in [14]. We used an annotation approach where the base graph is annotated at the policy design time. Each triple is annotated with a bitset representing its applicable authorizations. The subjects’ queries are evaluated over their positive subgraph constructed using her/his bitset and the triples’ bitset. The experiments showed that the annotation time is not negligible, but we argue that it is not an issue since this operation is done once and for all during policy design time. We showed that the overhead of the subject query evaluation is independent from size of the base graph, and it is about 50 %. Moreover, we showed that our approach is independent from the number of policy authorizations as well as the used query language in contrast to the query rewriting techniques.
Ongoing work on this platform includes the design of an algorithm for the incremental update of \(G^\mathscr {A} \) when G is modified, high-level optimizations for the construction of \(G^\mathscr {A} \) using the partial order between authorizations induced by basic graph pattern containment and new empirical evaluations on both synthetic and real-life data.
Footnotes
References
- 1.Abel, F., Coi, J.L., Henze, N., Koesling, A.W., Krause, D., Olmedilla, D.: Enabling advanced and context-dependent access control in RDF stores. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 1–14. Springer, Heidelberg (2007)CrossRefGoogle Scholar
- 2.Angles, R., Gutiérrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1–39 (2008)CrossRefGoogle Scholar
- 3.Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “bit” loaded: a scalable lightweight join query processor for RDF data. In: WWW, pp. 41–50 (2010)Google Scholar
- 4.Berners-Lee, T.: Linked data-design issues (2006). https://www.w3.org/DesignIssues/LinkedData.html
- 5.Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)CrossRefGoogle Scholar
- 6.Costabello, L., Villata, S., Delaforge, N., et al.: Linked data access goes mobile: context-aware authorization for graph stores. In: LDOW-5th WWW Workshop (2012)Google Scholar
- 7.Flouris, G., Fundulaki, I., Michou, M., Antoniou, G.: Controlling access to RDF graphs. In: Berre, A.J., Gómez-Pérez, A., Tutschku, K., Fensel, D. (eds.) FIS 2010. LNCS, vol. 6369, pp. 107–117. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15877-3_12 CrossRefGoogle Scholar
- 8.Hayes, P.J., Patel-Schneider, P.F.: RDF 1.1 semantics. W3C recommendation (2014). http://www.w3.org/TR/rdf11-mt/
- 9.Jain, A., Farkas, C.: Secure resource description framework: an access control model. In: SACMAT, pp. 121–129. ACM (2006)Google Scholar
- 10.Lopes, N., Kirrane, S., Zimmermann, A., Polleres, A., Mileo, A.: A logic programming approach for access control over RDF. In: ICLP, pp. 381–392 (2012)Google Scholar
- 11.Papakonstantinou, V., Michou, M., Fundulaki, I., Flouris, G., Antoniou, G.: Access control for RDF graphs using abstract models. In: SACMAT, pp. 103–112 (2012)Google Scholar
- 12.Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 1–45 (2009)CrossRefGoogle Scholar
- 13.Reddivari, P., Finin, T., Joshi, A.: Policy-based access control for an RDF store. In: WWW, pp. 78–81 (2005)Google Scholar
- 14.Sayah, T., Coquery, E., Thion, R., Hacid, M.-S.: Inference leakage detection for authorization policies over RDF data. In: Samarati, P. (ed.) DBSec 2015. LNCS, vol. 9149, pp. 346–361. Springer, Heidelberg (2015). doi:10.1007/978-3-319-20810-7_24 CrossRefGoogle Scholar