1 Introduction

The Linked Data infrastructure could potentially be used not only to distributedly share public data, but also to selectively share data, perhaps of a sensitive nature (e.g., personal data, health data, financial data, etc.), with specific individuals or organisations (i.e., closed data). In order to realise this vision, we must first extend the existing Linked Data infrastructure with suitable security mechanisms. More specifically, encryption is needed to protect data in case the server is compromised, while access control is needed to ensure that only authorised individuals can access specific data. Apart from the need to protect data, robustness in terms of usability, performance, and scalability is a major consideration.

However, current encryption techniques for RDF are still very limited, especially with respect to the flexible maintenance and querying of encrypted data in light of user access control policies. Initial partial encryption techniques [16, 17] focus on catering for both plain and encrypted data in the same representation and how to incorporate the metadata necessary for decryption. More recently, [21] proposed the generation of multiple ciphertexts per triple (i.e. each triple is encrypted multiple times depending on whether or not access to the subject, predicate and/or object is restricted) and the distribution of several keys to users. Although finer-grained access control is supported, the maintenance of multiple ciphertexts (i.e. encrypted triples) and keys presents scalability challenges. Additionally, such an approach, or likewise term-based encryption of RDF graphs, means that the structure of parts of the graph that should not be accessible could potentially be recovered, thus posing a security risk (cf. for instance [32]).

Beyond RDF, novel cryptography mechanisms have been developed that enable the flexible specification and enforcement of access policies over encrypted data. Predicate-based Encryption (PBE) [22] – which we refer to as Functional Encryption (FE) in order to avoid confusion with RDF predicates – enables searching over encrypted data, mainly for keywords or the conjunction of keyword queries, while alleviating the re-encryption burden associated with adding additional data.

Herein, we extend recent findings on FE to RDF, and demonstrate how FE can be used for fine-grained access control based on triples patterns over encrypted RDF datasets. Summarising our contributions, we: (i) adapt functional encryption to RDF such that it is possible to enforce access control over encrypted RDF data in a self-enforcing manner; (ii) demonstrate how encryption keys based on triple patterns can be used to specify flexible access control for Linked Data sources; and (iii) propose and evaluate indexing strategies that enhance query performance and scalability. Experiments show reasonable loading and query performance overheads with respect to traditional, non-encrypted data retrieval. The remainder of the paper is structured as follows: We discuss related work and potential alternatives to our proposal in Sect. 2. The details of our specific approach and optimisations are presented in Sects. 3 and 4 respectively, and evaluated in Sect. 5. Finally, we conclude and outline directions for future work in Sect. 6.

2 Related Work

When it comes to access control for RDF, broadly speaking researchers have focused on representing existing access control models and standards using semantic technology; proposing new access control models suitable for open, heterogeneous and distributed environments; and devising languages and frameworks that can be used to facilitate access control policy specification and maintenance. Kirrane et al. [23] provide a comprehensive survey of existing access control proposals for RDF. Unlike access control, encryption techniques for RDF has received very little attention to date. Giereth [17] demonstrate how public-key encryption techniques can be used to partially encryption RDF data represented using XML. While, Giereth [17] and Gerbracht [16] propose strategies for combining partially encrypted RDF data with the metadata that is necessary for decryption. Kasten et al. [21] propose a framework that can be used to query encrypted data. In order to support SPARQL queries based on triple patterns each triple is encrypted eight times according to the eight different binding possibilities. Limitations of the approach include the blowup associated with maintaining eight ciphers per triple and the fact that the structure of the graph is still accessible.

Searchable Symmetric Encryption (SSE) [9] has been extensively applied in database-as-a-service and cloud environments. SSE techniques focus on the encryption of outsourced data such that an external user can encrypt their query and subsequently evaluate it against the encrypted data. More specifically, SSE extracts the key features of a query (the data structures that allow for its resolution) and encrypts them such that it can be efficiently evaluated on the encrypted data. Extensive work has been done in basic SSE, which caters for a single keyword [6]. Recent improvements have been proposed to handle conjunctive search over multiple keywords [5], and to optimise the resolution to cater for large scale data in the presence of updates [4, 20, 30]. However, all of these works focus on keyword-based retrieval, whereas structured querying (such as SPARQL) over encrypted RDF datasets would require (at least) an unrestricted set of triple query patterns. In contrast, Fully Homomorphic Encryption (FHE) [15] allows any general circuit/computation over encrypted data, however it is prohibitively slow for most operations [7, 28]. Thus, practical, encryption databases such as CryptDB [28] make use of lighter forms of encryption that still cater for computations (such as sums) over the encrypted data [27], at the cost of different vulnerability/feasibility trade-offs. Recently, predicate encryption [22], whereby predicates correspond to the evaluation of disjunctions, polynomial equations and inner products, enables security in light of unrestricted queries. Predicate encryption has a proven track record of efficiency in terms of conjunctive equality, range and subset queries.

The solution we propose builds on an existing work that defines access control policies based on RDF patterns that are in turn enforced over RDF datasets [23]. While existing proposals enforce access control over plain RDF data via data filtering (i.e., a query is executed against a dataset which is generated by removing the unauthorised data) or query rewriting (i.e., a query is updated so that unauthorised data will not be returned and subsequently executed over the unmodified dataset), we demonstrate how functional encryption can be used to enforce access control over encrypted RDF data in a self-enforcing manner (i.e., without the need for either data filtering or query rewriting nor a trusted mediator). Unlike previous approaches we store one cipher per triple and employ indexing strategies based on secure hashes (cf. PBKDF2 [19]) that can be used for efficient querying of encrypted RDF. In addition, we propose a mechanism to obfuscate the graph structure with real indexes and dummy ciphers that cannot be decrypted, making the dummy hashes and ciphers indistinguishable from real hashes and ciphers.

3 Secure and Fine-Grained Encryption of RDF

Common public-key encryption schemes usually follow an all-or-nothing approach (i.e., given a particular decryption key, a ciphertext can either be decrypted or not) which in turn requires users to manage a large amount of keys, especially if there is a need for more granular data encryption [2]. Recent advances in public-key cryptography, however, have led to a new family of encryption schemes called Functional Encryption (FE) which addresses aforementioned issue by making encrypted data self-enforce its access restrictions, hence, allowing for fine-grained access over encrypted information. In a functional encryption scheme, each decryption key is associated with a boolean function and each ciphertext is associated with an element of some attribute space \(\varSigma \); a decryption key corresponding to a boolean function f is able to decrypt a particular ciphertext associated with \(I\in \varSigma \) iff \(f(I) = 1\). A functional encryption scheme is defined as a tuple of four distinct algorithms (Setup, Enc, KeyGen, Dec) such that:  

Setup :

is used for generating a master public and master secret key pair.

Enc :

encrypts a plaintext message m given the master public key and an element \(I\in \varSigma \). It returns a ciphertext c.

KeyGen :

takes as input the master secret key and generates a decryption key (i.e., secret key) \(SK_f\) for a given boolean function f.

Dec :

takes as input a secret key \(SK_f\) and a ciphertext c. It extracts I from c and computes f(I).

 

3.1 A Functional Encryption Scheme for RDF

While there exist various different approaches for realising functional encryption schemes, we build upon the work of Katz et al. [22] in which functions correspond to the computation of inner-products over \(\mathbb {Z}_N\) (for some large integer N). In their construction, they use \(\varSigma = \mathbb {Z}^n_N\) as set of possible ciphertext attributes of length n and \(\mathcal {F} = \{f_{\vec {x}} | \vec {x} \in \mathbb {Z}^n_N\}\) as the class of decryption key functions. Each ciphertext is associated with a (secret) attribute vector \(\vec {y} \in \varSigma \) and each decryption key corresponds to a vector \(\vec {x}\) that is incorporated into its respective boolean function \(f_{\vec {x}}\in \mathcal {F}\) where \(f_{\vec {x}}(\vec {y}) = 1\) iff \(\sum ^n_{i=1} y_ix_i = 0\).

In the following, we discuss how this encryption scheme can be utilised (i.e., its algorithms adoptedFootnote 1) to provide fine-grained access over encrypted RDF triples. Thus, allow for querying encrypted RDF using triple patterns such that a particular decryption key can decrypt all triples that satisfy a particular triple pattern (i.e., one key can open multiple locks). For example, a decryption key generated from a triple pattern (?,p,?) should be able to decrypt all triples with p in the predicate position.

Encrypting RDF Triples (Enc). To be able to efficiently encrypt large RDF datasets, we adopt a strategy commonly used in public-key infrastructures for securely and efficiently encrypting large amounts of data called Key Encapsulation [24]. Key encapsulation allows for secure but slow asymmetric encryption to be combined with simple but fast symmetric encryption by using asymmetric encryption algorithms for deriving a symmetric encryption key (usually in terms of a seed) which is subsequently used by encryption algorithms such as AES [11] for the actual encryption of the data. We illustrate this process in Fig. 1.

Fig. 1.
figure 1

Process of encrypting an RDF triple t.

Thus, to encrypt an RDF triple \(t=(s,p,o)\), we first compute its respective triple vector (i.e., attribute vector) \(\vec {y}_{t}\) and functionally encrypt (i.e., compute Enc as defined in [22]) a randomly generated seed \(m_{t}\) using \(\vec {y}_{t}\) as the associated attribute vector. Triple vector \(\vec {y_{t}}\) where \(\vec {y_{t}}=(y_{s},y'_{s},y_{p},y'_{p},y_{o},y'_{o})\) for triple t is constructed as follows, where \(\sigma \) denotes a mapping function that maps a triple’s subject, predicate, and object value to elements in \(\mathbb {Z}_N\):

$$\begin{aligned} y_{l} := -r\cdot \sigma (l), y'_l := r, \text {with}\, l\in \{s,p,o\} \,\,\text {and random}\,\, r\in \mathbb {Z}_N \end{aligned}$$

Table 1 illustrates the construction of a triple vector \(\vec {y_{t}}\) based on RDF triple t.

Table 1. Computing the triple vector \(\vec {y_{t}}\) of an RDF triple t.

We use AES to encrypt the actual plaintext triple t with an encryption key derivable from our previously generated seed \(m_{t}\) and return both, the resulting AES ciphertext of t denoted by \(\hat{t}\) and the ciphertext of the seed denoted by \(\hat{m}_{t}\) as final ciphertext triple \(c_t = \langle \hat{t},\hat{m}_t\rangle \).

Generating Decryption Keys (KeyGen). As outlined above, decryption keys must be able to decrypt all triples that satisfy their inherent triple pattern (i.e., one query key can open multiple locks). In order to compute a decryption key based on a triple pattern \(tp=(s,p,o)\) with sp,  and o either bound or unbound, we define its corresponding vector \(\vec {x}\) as \(\vec {x}_{tp}=(x_s,x'_s,x_p,x'_p,x_o,x'_o)\) with:

$$\begin{aligned} if \,l \,is \,bound: \, x_l := 1, x'_l := \sigma (l)\text {, with }l\in \{s,p,o\}\\ if \,l \,is \,not \,bound: \, x_l := 0, x'_l := 0\text {, with }l\in \{s,p,o\} \end{aligned}$$

Again, \(\sigma \) denotes a mapping function that maps a triple pattern’s subject, predicate, and object value to elements in \(\mathbb {Z}_N\). Table 2 illustrates the construction of a query vector \(\vec {x}_{tp}\) that corresponds to a triple pattern tp.

Table 2. Computing the query vector \(\vec {x}_{tp}\) that corresponds to a triple pattern tp

Decryption of RDF Triples (Dec). To verify whether an encrypted triple can be decrypted with a given decryption key, we compute the inner-product of their corresponding triple vector \(\vec {y}_{t}\) and query vector \(\vec {x}_{tp}\), with \(t=(s_t,p_t,o_t)\) and \(tp=(s_{tp},p_{tp},o_{tp})\):

$$\begin{aligned} \vec {y}_{t}\cdot \vec {x}_{tp}=y_{s_t} x_{s_{tp}}+y'_{s_t}x'_{s_{tp}}+y_{p_t}x_{p_{tp}}+y'_{p_t}x'_{p_{tp}}+y_{o_t}x_{o_{tp}}+y'_{o_t}x'_{o_{tp}} \end{aligned}$$

Only when \(\vec {y}_{t}\cdot \vec {x}_{tp}=0\) is it possible to decrypt the encrypted seed \(\hat{m}_{t}\), hence the corresponding symmetric AES key can be correctly derived and the plaintext triple t be returned. Otherwise (i.e., \(\vec {y}_{t}\cdot \vec {x}_{tp}\ne 0\)), an arbitrary seed \(m'\ne m_{t}\) is generated hence encrypted triple \(c_t\) cannot be decrypted [26].

4 Optimising Query Execution over Encrypted RDF

The secure data store holds all the encrypted triples, i.e. {\(c_{t_1}\), \(c_{t_2}\), \(\cdots \), \(c_{t_n}\)}, being n the total number of triples in the dataset. Besides assuring the confidentiality of the data, the data store is responsible for enabling the querying of encrypted data.

In the most basic scenario, since triples are stored in their encrypted form, a user’s query would be resolved by iterating over all triples in the dataset, checking whether any of them can be decrypted with a given decryption key. Obviously, this results in an inefficient process at large scale. As a first improvement one can distribute the set of encrypted triples among different peers such that decryption could run in parallel. In spite of inherent performance improvements, such a solution is still dominated by the available number of peers and the – potentially large – number of encrypted triples each peer would have to process. Current efficient solutions for querying encrypted data are based on (a) using indexes to speed up the decryption process by reducing the set of potential solutions; or (b) making use of specific encryption schemes that support the execution of operations directly over encrypted data [13]. Our solution herein follows the first approach, whereas the use of alternative and directly encryption mechanisms (such as homomorphic encryption [28]) is complementary and left to future work.

Fig. 2.
figure 2

3-Index approach for indexing and retrieval of encrypted triples.

In our implementation of such a secure data store, we first encrypt all triples and store them in a key-value structure, referred to as an EncTriples Index, where the keys are unique integer IDs and the values hold the encrypted triples (see Figs. 2 and 3 (right)). Note that this structure can be implemented with any traditional Map structure, as it only requires fast access to the encrypted value associated with a given ID. In the following, we describe two alternative approaches, i.e., one using three individual indexes and one based on Vertical Partitioning (VP) for finding the range of IDs in the EncTriples Index which can satisfy a triple pattern query. In order to maintain simplicity and general applicability of the proposed store, both alternatives consider key-value backends, which are increasingly used to manage RDF data [8], especially in distributed scenarios. It is also worth mentioning that we focus on basic triple pattern queries as (i) they are the cornerstone that can be used to build more complex SPARQL queries, and (ii) they constitute all the functionality to support the Triple Pattern Fragments [31] interface.

3-Index Approach. Following well-known indexing strategies, such as from CumulusRDF [25], we use three key-value B-Trees in order to cover all triple pattern combinations: SPO, POS and OSP Indexes. Figure 2 illustrates this organisation. As can be seen, each index consists of a Map whose keys are the securely hashed (cf. PBKDF2 [19]) subject, predicate, and object of each triple, and values point to IDs storing the respective ciphertext triples in the EncTriples Index.

Algorithm 1 shows the resolution of a (s,p,o) triple pattern query using the 3-Index approach. First, we compute the secure hashes h(s), h(p) and h(o) from the corresponding s, p and o provided by the user (Line 1). Our hash(spo) function does not hash unbounded terms in the triple pattern but treats them as a wildcard ‘?’ term (hence all terms will be retrieved in the subsequent range queries). Then, we select the best index to evaluate the query (Line 2). In our case, the SPO Index serves (s,?,?) and (s,p,?) triple patterns, the POS Index satisfies (?,p,?) and (?,p,o), and the OSP Index index serves (s,?,o) and (?,?,o). Both (s,p,o) and (?,?,?) can be solved by any of them. Then, we make use of the selected index to get the range of values where the given h(s), h(p), h(o) (or ‘anything’ if the wildcard ‘?’ is present in a term) is stored (Line 3). Note that this search can be implemented by utilising B-Trees [10, 29] for indexing the keys. For each of the candidate ID values in the range (Line 4), we retrieve the encrypted triple for such ID by searching for this ID in the EncTriples Index (Line 5). Finally, we proceed with the decryption of the encrypted triple using the key provided by the user (Line 6). If the status of such decryption is valid (Line 7) then the decryption was successful and we output the decrypted triples (Line 8) that satisfy the query.

figure a

Thus, the combination of the three SPO, POS and OSP Indexes reduces the search space of the query requests by applying simple range scans over hashed triples. This efficient retrieval has been traditionally served through tree-based map structures guaranteeing log(n) costs for searches and updates on the data, hence we rely on B-Tree stores for our practical materialisation of the indexes. In contrast, supporting all triple pattern combinations in 3-Index comes at the expense of additional space overheads, given that each (h(s),h(p),h(o)) of a triple is stored three times (in each SPO, POS and OSP Indexes). Note, however, that this is a typical scenario for RDF stores and in our case the triples are encrypted and stored just once (in EncTriples Index).

Fig. 3.
figure 3

Vertical Partitioning (VP) approach for indexing and retrieval of encrypted triples.

Vertical Partitioning Approach. Vertical partitioning [1] is a well-known RDF indexing technique motivated by the fact that usually only a few predicates are used to describe a dataset [14]. Thus, this technique stores one “table” per predicate, indexing (S,O) pairs that are related via the predicate. In our case, we propose to use one key-value B-Tree for each h(p), storing (h(s),h(o)) pairs as keys, and the corresponding ID as the value. Similar to the previous case, the only requirement is to allow for fast range queries on their map index keys. However, in the case of an SO index, traditional key-value schemes are not efficient for queries where the first component (the subject) is unbound. Thus, to improve efficiency for triple patterns with unbounded subject (i.e. (?,\(p_y\),\(o_z\)) and (?,?,\(o_z\))), while remaining in a general key-value scheme, we duplicate the pairs and introduce the inverse (h(o),h(s)) pairs. The final organisation is shown in Fig. 3 (left), where the predicate maps are referred to as Pred_h(p\(_1\)), Pred_h(p\(_2\)),..., Pred_h(p\(_n\)) Indexes. As depicted, we add "so" and "os" keywords to the stored composite keys in order to distinguish the order of the key.

Algorithm 2 shows the resolution of a (s,p,o) triple pattern query with the VP organisation. In this case, after performing the variable initialisation (Line 1) and the aforementioned secure hash of the terms (Line 2), we inspect the predicate term h(p) and select the corresponding predicate index (Line 3), i.e., Pred_h(p). Nonetheless, if the predicate is unbounded, all predicate indexes are selected as we have to iterate through all tables, which penalises the performance of such queries. For each predicate index, we then inspect the subject term (Lines 5–9). If the subject is unbounded (Line 5), we will perform a ("os",h(o),?) range query over the corresponding predicate index (Line 6), otherwise we execute a ("so",h(s),h(o)) range query. Note that in both cases the object could also be unbounded. The algorithm iterates over the candidates IDs (Lines 10-end) in a similar way to the previous cases, i.e., retrieving the encrypted triple from EncTriples Index (Line 11) and performing the decryption (Lines 12–14).

figure b

Overall, VP needs less space than the previous 3-Index approach, since the predicates are represented implicitly and the subjects and objects are represented only twice. In contrast, it penalises the queries with unbound predicate as it has to iterate through all tables. Nevertheless, studies on SPARQL query logs show that these queries are infrequent in real applications [3].

Protecting the Structure of Encrypted Data. The proposed hash-based indexes are a cornerstone for boosting query resolution performance by reducing the encrypted candidate triples that may satisfy the user queries. The use of secure hashes [19] assures that the terms cannot be revealed but, in contrast, the indexes themselves reproduce the structure of the underlying graph (i.e., the in/out degree of nodes). However, the structure should also be protected as hash-based indexes can represent a security risk if the data server is compromised. State-of-the-art solutions (cf., [13]) propose the inclusion of spurious information, that the query processor must filter out in order to obtain the final query result.

In our particular case, this technique can be adopted by adding dummy triple hashes into the indexes with a corresponding ciphertext (in EncTriples Index) that cannot be decrypted by any key, hence will not influence the query results. Such an approach ensures that both the triple hashes and their corresponding ciphertexts are not distinguishable from real data.

5 Evaluation

We develop a prototypical implementationFootnote 2 of the proposed encryption and indexing strategies. Our tool is written in Java and it relies on the Java Pairing-Based Cryptography Library (JPBC [12]) to perform all the encryption/decryption operations. While, we use MapDBFootnote 3 as the supporting framework for the indexes. We provide an interface that takes as input a triple pattern query and a query key, and outputs the results of the query.

We evaluate our proposal in two related tasks: (i) performance of the data loading (encryption and indexing) and (ii) performance of different user queries (query execution on encrypted data). In both cases, we compare our proposed 3-Index strategy w.r.t the vertical partitioning (VP) approach. Finally, we measure the performance overhead associated with query resolution, introduced by the secure infrastructure, by comparing its results with a counterpart non-secure triplestore. For a fair comparison, we implement the non-secure triplestore with similar 3-Index and VP indexing strategies, storing the RDF data in plain. The approaches are referred to as 3-Index-plain and VP-plain respectively.

Table 3. Statistical dataset description.

Table 3 describes our experimental datasets, reporting the number of triples, different subjects (\(|\mathrm{S}|\)), predicates (\(|\mathrm{P}|\)) and objects (\(|\mathrm{O}|\)), as well as the file size (in NT format). Note that there is no standard RDF corpus that can be used to evaluate RDF encryption approaches, hence we choose a diverse set of datasets that have been previously used to benchmark traditional RDF stores or there is a use case that indicates they could potentially benefit from a secure data store. On the one hand, we use the well-known Lehigh University Benchmark (LUBM [18]) data generator to obtain synthetic datasets of incremental sizes from 100K triples to 5M triples. On the other hand, we choose real-world datasets from different domains: Census represents the 2010 Australian census, where sensitive data must be preserved and users could have different partial views on the dataset; Jamendo lists music records and artists, where some data can be restricted to certain subscribers; and AEMET includes sensor data from weather stations in Spain, which is a real use case where the old data is public but the most recent data is restricted to particular users. Tests were performed on a computer with 2 x Intel Xeon E5-2650v2 @ 2.6 GHz (16 cores), RAM 171 GB, 4 HDDs in RAID 5 config. (2.7 TB netto storage), Ubuntu 14.04.5 LTS running on a VM with QEMU/KVM hypervisor. All of the reported (elapsed) times are the average of three independent executions.

Fig. 4.
figure 4

Time for loading (encrypting+indexing) the entire dataset for 3-Index and VP. We only report indexing time for the non-secure counterparts 3-Index-plain and VP-plain.

Data Loading. Figure 4 shows the dataset load timesFootnote 4 for the 3-Index and VP strategies. The reported time consists of the time to encrypt the triples using the aforementioned FE scheme, and the time to securely hash the terms and create the different indexes. In contrast, the non-secure triplestores, i.e. the 3-Index-plain and VP-plain counterparts, only require the dataset to be indexed (we also make use of the hash of the terms in order to compare the encryption overhead).

The results show that the time of both the 3-Index and the VP strategy scales linearly with the number of triples, which indicates that the representation can scale in the envisioned Linked Data scenario. It is worth noting that both strategies report similar performance results, where VP is slightly faster for loading given that only the subject and object is used to index each triple (the predicate is implicitly given by vertical partitioning). Finally, note that the comparison w.r.t the plain counterparts shows that the encryption overhead can be of one order of magnitude for the smaller datasets. In contrast, the encryption overhead is greatly reduced for larger datasets which is primarily due to the fact that the loading time for large datasets is the predominant factor, as the B-Tree indexes become slower the more triples are added (due to rebalancing).

Query Resolution. Figure 5 shows the query resolution time for two selected datasetsFootnote 5, LUBM with 5M triples and Jamendo, considering all types of triple patterns. To do so, we sample 1,000 queries of each type and report the average resolution time. As expected, the 3-Index reports a noticeable better performance than VP for queries with unbound predicates given that VP has to iterate though all predicate tables in this case. In turn, the 3-Index and the VP approaches remain competitive with respect to their non-secure counterparts, if a look-up returns only a small amount of results as it is usually the case for (s,?,?), (s,?,o), (s,p,o) queries. However, the more query results that need to be returned the longer the decryption takes. At this point we also want to stress that due to the nature of our approach, each result triple can be returned as soon as its decryption has finished. This is in line with the incremental nature of the Triple Pattern Fragment [31] approach, which paginates the query results (typically including 100 results per page), allowing users to ask for further pages if required. For example, decrypting Jamendo entirely took about 2256 s for VP and 2808 s for 3-Index, leading to respective triple decryption rates of 465 triples/s and 374 triples/s in a cold scenario, which already fulfils the performance requirements to feed several Triple Pattern Fragments per second.

Fig. 5.
figure 5

Cold query times of LUBM with 5M triples (LHS) and Jamendo (RHS) for 3-Index, VP, and their non-secure counterparts in ms (logarithmic y-axis).

Scalability. As mentioned in Sect. 4, our approach allows for parallel encryption/decryption of triples, thus scales with the system’s supported level of parallelisation/number of available cores (e.g., encrypting and indexing (3-Index) 10000 LUBM triples takes about 76 s with 16 available cores, 133s with 8, 262s with 4, and 497 s with 2 available cores).

Our experiments have shown that (i) the performance of our indexing strategy is not affected by the encryption, hence, is as effective on encrypted data as it is on non-encrypted data, and (ii) the decryption of individual triples is a fast process which can be utilised in our Linked Data scenario, especially under the umbrella of the Linked Data Fragments framework.

6 Conclusion

To date Linked Data publishers have mainly focused on exposing and linking open data, however there is also a need to securely store, exchange, and query also sensitive data alongside (i.e., closed data). Both access control and encryption mechanisms are needed to protect such data from unauthorised access, security breaches, and potentially untrusted service providers. Herein, we presented a mechanism to provide secure and fine-grained encryption of RDF datasets. First, we proposed a practical realisation of a functional encryption scheme, which allows data providers to generate query keys based on (triple-)patterns, whereby one decryption key can decrypt all triples that match its associated triple pattern. As such, our approach operates on a very fine level of granularity (i.e., triple level), which provides a high degree of flexibility and enables controlled access to encrypted RDF data. In existing literature, enforcing access control at the level of single statements or tuples is generally referred to as fine-grained access control (cf. [23]). Then, we presented two indexing strategies (implemented using MapDB) to enhance query performance, the main scalability bottleneck when it comes to serving user requests.

Our empirical evaluation shows that both indexing strategies on encrypted RDF data report reasonable loading and query performance overheads with respect to traditional, non-encrypted data retrieval. Our results also indicate that the approach is relatively slow for batch decryption, but this can be counteracted by the fact that it is suitable for serving incremental results, hence it is particularly suitable for Linked Data Fragments.

In future work, we plan to inspect different indexing strategies in order to optimise the loading time and query performance of large queries. We also consider extending our proposal to cater for named graphs, that is, encrypting quads instead of triples and generating keys based on quad patterns. Finally, we aim to integrate the proposed secure RDF store with a “policy” tier by employing Attribute-based Access Control (ABAC), which will manage the access/revocation to the query keys and serve as fully fledged security framework for Linked Data.