The secure data store holds all the encrypted triples, i.e. {\(c_{t_1}\), \(c_{t_2}\), \(\cdots \), \(c_{t_n}\)}, being n the total number of triples in the dataset. Besides assuring the confidentiality of the data, the data store is responsible for enabling the querying of encrypted data.
In the most basic scenario, since triples are stored in their encrypted form, a user’s query would be resolved by iterating over all triples in the dataset, checking whether any of them can be decrypted with a given decryption key. Obviously, this results in an inefficient process at large scale. As a first improvement one can distribute the set of encrypted triples among different peers such that decryption could run in parallel. In spite of inherent performance improvements, such a solution is still dominated by the available number of peers and the – potentially large – number of encrypted triples each peer would have to process. Current efficient solutions for querying encrypted data are based on (a) using indexes to speed up the decryption process by reducing the set of potential solutions; or (b) making use of specific encryption schemes that support the execution of operations directly over encrypted data [13]. Our solution herein follows the first approach, whereas the use of alternative and directly encryption mechanisms (such as homomorphic encryption [28]) is complementary and left to future work.
In our implementation of such a secure data store, we first encrypt all triples and store them in a key-value structure, referred to as an EncTriples Index, where the keys are unique integer IDs and the values hold the encrypted triples (see Figs. 2 and 3 (right)). Note that this structure can be implemented with any traditional Map structure, as it only requires fast access to the encrypted value associated with a given ID. In the following, we describe two alternative approaches, i.e., one using three individual indexes and one based on Vertical Partitioning (VP) for finding the range of IDs in the EncTriples Index which can satisfy a triple pattern query. In order to maintain simplicity and general applicability of the proposed store, both alternatives consider key-value backends, which are increasingly used to manage RDF data [8], especially in distributed scenarios. It is also worth mentioning that we focus on basic triple pattern queries as (i) they are the cornerstone that can be used to build more complex SPARQL queries, and (ii) they constitute all the functionality to support the Triple Pattern Fragments [31] interface.
3-Index Approach. Following well-known indexing strategies, such as from CumulusRDF [25], we use three key-value B-Trees in order to cover all triple pattern combinations: SPO, POS and OSP Indexes. Figure 2 illustrates this organisation. As can be seen, each index consists of a Map whose keys are the securely hashed (cf. PBKDF2 [19]) subject, predicate, and object of each triple, and values point to IDs storing the respective ciphertext triples in the EncTriples Index.
Algorithm 1 shows the resolution of a (s,p,o) triple pattern query using the 3-Index approach. First, we compute the secure hashes h(s), h(p) and h(o) from the corresponding s, p and o provided by the user (Line 1). Our hash(s, p, o) function does not hash unbounded terms in the triple pattern but treats them as a wildcard ‘?’ term (hence all terms will be retrieved in the subsequent range queries). Then, we select the best index to evaluate the query (Line 2). In our case, the SPO Index serves (s,?,?) and (s,p,?) triple patterns, the POS Index satisfies (?,p,?) and (?,p,o), and the OSP Index index serves (s,?,o) and (?,?,o). Both (s,p,o) and (?,?,?) can be solved by any of them. Then, we make use of the selected index to get the range of values where the given h(s), h(p), h(o) (or ‘anything’ if the wildcard ‘?’ is present in a term) is stored (Line 3). Note that this search can be implemented by utilising B-Trees [10, 29] for indexing the keys. For each of the candidate ID values in the range (Line 4), we retrieve the encrypted triple for such ID by searching for this ID in the EncTriples Index (Line 5). Finally, we proceed with the decryption of the encrypted triple using the key provided by the user (Line 6). If the status of such decryption is valid (Line 7) then the decryption was successful and we output the decrypted triples (Line 8) that satisfy the query.
Thus, the combination of the three SPO, POS and OSP Indexes reduces the search space of the query requests by applying simple range scans over hashed triples. This efficient retrieval has been traditionally served through tree-based map structures guaranteeing log(n) costs for searches and updates on the data, hence we rely on B-Tree stores for our practical materialisation of the indexes. In contrast, supporting all triple pattern combinations in 3-Index comes at the expense of additional space overheads, given that each (h(s),h(p),h(o)) of a triple is stored three times (in each SPO, POS and OSP Indexes). Note, however, that this is a typical scenario for RDF stores and in our case the triples are encrypted and stored just once (in EncTriples Index).
Vertical Partitioning Approach. Vertical partitioning [1] is a well-known RDF indexing technique motivated by the fact that usually only a few predicates are used to describe a dataset [14]. Thus, this technique stores one “table” per predicate, indexing (S,O) pairs that are related via the predicate. In our case, we propose to use one key-value B-Tree for each h(p), storing (h(s),h(o)) pairs as keys, and the corresponding ID as the value. Similar to the previous case, the only requirement is to allow for fast range queries on their map index keys. However, in the case of an SO index, traditional key-value schemes are not efficient for queries where the first component (the subject) is unbound. Thus, to improve efficiency for triple patterns with unbounded subject (i.e. (?,\(p_y\),\(o_z\)) and (?,?,\(o_z\))), while remaining in a general key-value scheme, we duplicate the pairs and introduce the inverse (h(o),h(s)) pairs. The final organisation is shown in Fig. 3 (left), where the predicate maps are referred to as Pred_h(p\(_1\)), Pred_h(p\(_2\)),..., Pred_h(p\(_n\)) Indexes. As depicted, we add "so" and "os" keywords to the stored composite keys in order to distinguish the order of the key.
Algorithm 2 shows the resolution of a (s,p,o) triple pattern query with the VP organisation. In this case, after performing the variable initialisation (Line 1) and the aforementioned secure hash of the terms (Line 2), we inspect the predicate term h(p) and select the corresponding predicate index (Line 3), i.e., Pred_h(p). Nonetheless, if the predicate is unbounded, all predicate indexes are selected as we have to iterate through all tables, which penalises the performance of such queries. For each predicate index, we then inspect the subject term (Lines 5–9). If the subject is unbounded (Line 5), we will perform a ("os",h(o),?) range query over the corresponding predicate index (Line 6), otherwise we execute a ("so",h(s),h(o)) range query. Note that in both cases the object could also be unbounded. The algorithm iterates over the candidates IDs (Lines 10-end) in a similar way to the previous cases, i.e., retrieving the encrypted triple from EncTriples Index (Line 11) and performing the decryption (Lines 12–14).
Overall, VP needs less space than the previous 3-Index approach, since the predicates are represented implicitly and the subjects and objects are represented only twice. In contrast, it penalises the queries with unbound predicate as it has to iterate through all tables. Nevertheless, studies on SPARQL query logs show that these queries are infrequent in real applications [3].
Protecting the Structure of Encrypted Data. The proposed hash-based indexes are a cornerstone for boosting query resolution performance by reducing the encrypted candidate triples that may satisfy the user queries. The use of secure hashes [19] assures that the terms cannot be revealed but, in contrast, the indexes themselves reproduce the structure of the underlying graph (i.e., the in/out degree of nodes). However, the structure should also be protected as hash-based indexes can represent a security risk if the data server is compromised. State-of-the-art solutions (cf., [13]) propose the inclusion of spurious information, that the query processor must filter out in order to obtain the final query result.
In our particular case, this technique can be adopted by adding dummy triple hashes into the indexes with a corresponding ciphertext (in EncTriples Index) that cannot be decrypted by any key, hence will not influence the query results. Such an approach ensures that both the triple hashes and their corresponding ciphertexts are not distinguishable from real data.