1 Introduction

As sensitive information is increasingly stored on the cloud, privacy concerns on unauthorized data access or inferences have been a critical factor for individuals or corporations to adopt cloud-based information retrieval services. A dilemma considered is that a user wants to take full advantage of the cloud resource to search a large hosted dataset quickly with an index stored together on the cloud, but this user may not fully trust this cloud on data privacy. This is because a cloud server may be honest-but-curious and can observe the client-initiated query processing flow, and reason about client’s data. Consequently, user data leakage may occur accidentally if not intentionally. The previous work in searchable encryption (Song et al., 2000; Curtmola et al., 2006; Kamara et al., 2012; Cash et al., 2013, 2014; Kamara & Moataz, 2017; Lai et al., 2018) has been proposed to identify documents that match a user query from the encrypted index with minimized information leakage. None of these papers has considered ranking of the matched results. Privacy issues have also attracted research interests in the information retrieval community and SIGIR held several privacy workshops on this subject (Yang & Soboroff, 2015; Yang et al., 2016).

This paper studies privacy-enhanced document matching and feature access to facilitate top-k search with ranking over an encrypted index. Privacy-aware ranking solutions (Ji et al., 2018; Shao et al., 2019) with tree ensembles and neural networks have studied how to avoid the leakage of privacy-sensitive feature information during ranking computation with ranking feature encoding and obfuscation, but they do not address how to retrieve top matched documents and gather their features safely for ranking. Thus this paper studies this open problem. The work in Agun et al. (2018) studies additive ranking with retrieval, but a server needs to send a large amount of data back to a client to complete ranking, and the work in  Xu et al. (2021) furthers pushes for client-side search by securely fetching index data from a server. These studies do not support full server-side ranking.

Document retrieval for search often uses an inverted index in which each searchable term is associated with a list of posting records containing document ID and its term-based features. For privacy protection, document IDs and their ranking features of an index are encoded or encrypted when hosted on a cloud server. The challenge in designing such a search index is that a server adversary can inspect the index, observe the query processing flow, and learn some sensitive patterns, which may recover document plaintext even partially. Thus a document retrieval solution needs to minimize the chance that a server processes unauthorized queries or learn sensitive information from query processing. The previous studies on privacy-abuse attacks (Islam et al., 2012; Cash et al., 2015; Wang et al., 2018; Pouliot & Wright, 2016) show the leakage of the information about term co-occurrence in a document can yield a plaintext attack. For example, if an adversary knows word “new” appears in a document and can estimate its co-occurrence probability with another word x in the same index. Then this adversary may be able to recover x as “york” appeared in this document by comparing against some known knowledge on the co-occurrence probabilities of pair “new” and “york”, and other word pairs. Notice that the co-occurrence probability of two terms is computed as being proportional to the number of documents shared between the posting lists of these two terms. Thus avoiding leakage of document sharing among posting lists is our key consideration in privacy protection.

When designing an index for privacy-aware search that retrieves and ranks documents with their features, one challenge faced is how to associate term-specific features distributed in different online data structures with a document that owns these features, without leaking document sharing patterns among posting lists. If such a document ID is encrypted with a fixed static value, this value does not change from one query to another. Then a server adversary can observe the document sharing pattern among two posting lists based on static document IDs, and can estimate the co-occurrence probability of these two terms appearing in the same document in this index. Following the discussion in the previous paragraph, the risk of a privacy attack or leakage that successfully recovers some plaintext exists. On the other hand, if the encrypted value of the same document ID is non-deterministic in posting lists of different terms, the server is unable to recognize term-based features belonging to the same matched document and conduct effective ranking. The existing searchable encryption studies use non-deterministic document IDs and cannot be easily extended for a search system that gathers ranking features and computes preliminary scores during document retrieval. Another challenge faced is that cryptographic computing operators used in searchable encryption techniques involve very long bits of integers, which is expensive and becomes a performance bottleneck for a reasonable query response time even for a modestly large dataset.

The contribution of this paper is to address this open problem of privacy-aware document retrieval with ranking by introducing a two-level inverted index and query-specific tagging called QDT. QDT gathers encoded ranking features of matched documents with privacy-protection to facilitate document ranking. During index traversal, QDT generates query-specific tags to correctly recognize the term-specific ranking features of matched documents and uses a two-level index structure to strike a trade-off of privacy and efficiency for ensuring a reasonable query response time. The design of QDT minimizes the chance that a server conducts unauthorized search with ranking or learns document sharing patterns of posting lists when processing a sequence of client-authorized queries. Our evaluation indicates that adding privacy protection for document retrieval does incur significant cost compared to the well-known retrieval baselines without privacy constraints, and the result demonstrates the importance of our two-level design in optimizing the use of cryptographic computation under privacy requirement. Another contribution of this paper is to show that a leakage-abuse attack is still possible in a naive version of the above design, and the guided fake document padding in QDT thwarts such an attack. Our evaluation validates its compatibility with ranking and its effectiveness in attack prevention.

The rest of this paper is organized as follows. Section 2 provides the background information of the problem focused in this paper and the related work. Section 3 discusses the design considerations and the proposed two-level index data structure, document-specific tagging, and guided fake document padding. Section 4 provides an evaluation of the proposed QDT scheme in terms of time efficiency, and its compatibility with ranking and its effectiveness in attack prevention. Appendix 1 discusses a design consideration of QDT in preventing an attack based on document-pair ID mapping. Appendix 2 provides several privacy properties of QDT. Appendix 3 lists the leakage profile of QDT following an analytic framework of the previous work.

2 Background and related work

Search systems for text documents often employ multi-stage ranking in practice (e.g.  (Matveeva et al., 2006; Wang et al., 2011)). The first retrieval stage extracts top candidate documents matching a query from a large search index with a fast and simple ranking method. The second stage or a later stage uses a more complex machine learning algorithm to re-rank top results thoroughly. Following the disjunctive query semantics widely used in the previous work (e.g.  (Broder et al., 2003; Ding & Suel, 2011)), this paper is focused on the problem of document retrieval that identifies documents matching at least one search term, and aggregates their term features for ranking in privacy-aware cloud search. This extra privacy protection constraint requires the minimization of the information leakage to a cloud server and the threat model is discussed below. This paper assumes that document search returns a list of document IDs ranked on the top for a client to make further data fetching.

Traditional retrieval algorithms in IR often use an inverted index (Baeza-Yates & Ribeiro-Neto, 2011) which contains a set of terms and each term points to a list of document IDs that possess a feature for such a term. This list is called a posting list. Each record in this list containing a document ID and its term-specific feature is called a posting record.

Privacy and threat model There are three entities in a cloud system: data set owner, a search user, and a cloud server. We assume a client owns a collection of private documents and outsources the encrypted data to a cloud server. The client builds an encrypted searchable index, lets a server host such index, and only this client or the other users authorized by this client can search the hosted data. We assume that a client can periodically overwrite the index on the cloud to include new content. Dynamic index update (Kamara et al., 2012) is not considered in this paper. The server is allowed to access this hosted index for client query processing only, and it is honest-but-curious, i.e., the server will honestly follow the client’s protocol, but will also try to learn some information based on what can be observed from the index and during query processing. Multi-round communication between the server and client (e.g.  (Naveed et al., 2014; Hu et al., 2011; Lai et al., 2018)), client–server collaborative ranking (Agun et al., 2018), and securely fetching index data from a server to a client for client-side search (Xu et al., 2021) are not considered in this paper because they incur high client–server communication cost for large datasets or low-bandwidth platforms, and they represent an orthogonal approach.

For privacy protection, document IDs and index information in a search system are encrypted or encoded. The biggest threat to search privacy is the leakage of document/term-related statistical information induced from an index and access patterns within a query or across queries. From the above information, a server can launch a privacy attack and reveal document content partially  (Islam et al., 2012; Cash et al., 2015; Wang et al., 2018). Specifically, Islam et al. (2012) proposed a query recovery attack to identify the plaintext of some words, assuming that a server adversary knows about a subset of plaintext of some words, and the co-occurrence probability of two words in a document. Through intensive guessing and validation computation, an adversary attack may estimate the co-occurrence probability of words in a hosted index, derive a mapping from word IDs to English words that closely matches the known co-occurrence probability, which reveals the plaintext of some words as a privacy breach. Cash et al. (2015) improved the above work by exploiting extra information such as the length of the posting list of each searchable term. There are also attacks exploiting leaked document similarities (Wang et al., 2018).

The above attacks only work if the adversary knows co-occurrence probability of targeted keywords in the document set, which is proportional to the count of shared documents among the posting lists of these keywords. By preventing the leakage of the document sharing information of documents among posting lists in a hosted index, threats from these attacks can be greatly alleviated.

Blackstone et al. (2019) introduces attacks based on volume information (document length). This paper assumes the server only needs to return the encrypted document IDs of top results but not the text of each document, and a client may find document text from another place later with these IDs. Thus, these volume attacks would not be applicable in our case.

Document retrieval with privacy-aware ranking The previous work on document retrieval (e.g.  (Broder et al., 2003; Ding & Suel, 2011)) often selects top k results based on a simple additive formula as the first stage of search and it computes the rank score of each document d as: \(RankScore(d)= \sum _{t \in Q} w_t\) where Q is the set of all search terms and \(w_t\) is a term-specific weight of this document. An example of such additive formula is BM25 (Jones et al., 2000) which is widely used. This paper assumes the disjunctive query semantics, motivated by the above formula which implies any document that contains a query term accumulates some score and is a candidate for the top results. Notice that relevant text documents may not contain any query term in many applications, and recent advancement in document term expansion through a deep neural model addresses document-query vocabulary mismatch (Formal et al., 2021; Mallia et al., 2021; Lin & Ma, 2021). Thus the use of disjunctive semantics can still be applied widely by coupling with such techniques.

To conduct additive scoring like BM25, a server needs to decrypt the encrypted features before computation, that results in the leakage of sensitive ranking features. Homomorphic encryption (Gentry, 2009; Paillier, 1999) can let the server perform arithmetic calculations without decrypting the underlying data. But such a scheme does not have the ability of comparing two results using scores computed with homomorphic encryption at the server side, and it is also very time consuming and computationally non-scalable when many numbers are involved in ranking computation. Dense matrix multiplication is used in  Cao et al. (2014), Sun et al. (2014), Xia et al. (2016) for additive ranking, and its complexity is proportional to the number of documents multiplied by the number of distinct words, which is not scalable for a large data collection. The evaluation of this paper adopts a simple obfuscation-based approach to improve privacy protection by mapping each BM25 term feature into a set of non-uniform partitions to hide privacy-sensitive true values. This follows the value partitioning technique (Hacigümüş et al., 2002) as extreme downsampling (Ryoo et al., 2017) to mask privacy-sensitive feature details with obfuscation.

Our scheme can also be used for more complex ranking. For example, one may opt to directly adopt a tree ensemble or neural ranking method without going through the first-stage BM25 for a relatively small or medium-sized document collection for which such a method can be efficient enough. Privacy-aware tree ensemble ranking with feature encoding is studied by  Ji et al. (2018) and neural ranking with feature obfuscation is proposed by  Shao et al. (2019) based on KNRM and ConvKNRM ranking (Xiong et al., 2017; Dai et al., 2018). None of the above work addresses how to safely gather ranking features during document retrieval. The techniques developed in this paper to gather ranking features in a privacy-aware manner can be applicable when such a ranking method is used. The transformer-based ranking models with contextual embeddings outperform KNRM and ConvKNRM on relevance (e.g.  (Lin et al., 2020; MacAvaney et al., 2020; Khattab & Zaharia, 2020; Yang et al., 2022; Li et al., 2023)). This paper does not use transformer-based models in the evaluation because there are no studies on how to address privacy issues for such complex ranking models. Since the work in Shao et al. (2019) provides a privacy-aware neural re-ranking even it is based on static document embeddings, this paper leverages  Shao et al. (2019) to demonstrate safe top result retrieval and feature aggregation, and conducts an end-to-end neural ranking under privacy constraints. Recent optimization studies on learned sparse representation of documents using BERT have shown strong relevance and efficiency results (Formal et al., 2022; Qiao et al., 2023; Thakur et al., 2023), and these techniques are orthogonal optimization that our work can leverage in the future.

Searchable encryption, query authorization, and document identification The previous work on searchable encryption has not addressed ranking issues including feature gathering. However the searchable encryption techniques can be leveraged to help solving our problem with the disjunctive query search semantics.

Although OXT (Cash et al., 2013, 2014) does not support disjunctive query semantics, it provides a searchable encryption mechanism that a server can extract relevant information from an index and process a query only with an authorization from a client. A client sends a start-up term token and additional intersection tokens acting as a mean of authorizing conjunctive search on the encoded index hosted on a server. Lai et al. developed the HXT scheme (Lai et al., 2018) that improves the security of OXT using hidden vector encryption and a Bloom filter with extra communication overhead in multi-round client–server communication. The work by  Agun et al. (2018) extends OXT for additive ranking but it only supports partial server-side ranking and large client–server communication is needed, and thus it does not scale well for a large dataset. All of the above works require one of terms to lead the flow of conjunction handling. and cannot be extended easily and efficiently to support disjunctive semantics.

In a traditional inverted index, it is natural and straightforward to store a feature in a posting record of a term, and use a document ID to show the ownership of this feature. The above natural way of associating a term feature with a deterministic and static document ID would not work for the privacy-sensitive index. That is because a server can easily count that the number of common static document IDs shared among the posting lists of two terms, to compute the co-occurrence probability of targeted words, defined as the number of documents that appear in two corresponding posting lists divided by the number of documents. The above action can lead to the plaintext attack as mentioned above.

The IEX scheme introduced by  Kamara and Moataz (2017) supports disjunctive queries, where the same document ID that appears in different posting records has a different encrypted value with non-deterministic encryption. Thus document sharing patterns among posting lists are not leaked explicitly. Because of the above non-deterministic encryption, the server cannot figure out features of the same document appearing in different posting records and thus it is difficult to extend IEX to support ranking feature gathering.

Static index vs. dynamic update Recent studies on searchable encryption for conjunctive queries have proposed ODXT (Patranabis & Mukhopadhyay, 2021) and ESP-CKS (Xu et al., 2023) for the dynamic update of a database and its index, which is the main advancement compared to OXT and HXT, but server side ranking is not addressed. Our work on the other hand, does not support dynamic document addition and deletion to an existing search index, but focuses on ranking with a static index. We assume the search index in a hosted server can be refreshed periodically with a new or updated index. Thus the above dynamic index update work is orthogonal to this paper.

Fake document padding, obfuscation, and differential privacy The leakage of the posting list length can further aid text-revealing privacy attacks together with term co-occurrence leakage as shown in  Islam et al. (2012), Cash et al. (2015), Pouliot and Wright (2016). To avoid this, padding fake documents in a posting list is proposed in Islam et al. (2012), Patel et al. (2019), Kamara and Moataz (2019) for searchable encryption. A bucket-based padding technique is proposed in  Kamara and Moataz (2019), but it is not extensible for ranking because it cannot identify the same documents referenced in different posting records. Also to store ranking features in this scheme, excessive feature duplication is needed and the space cost is very expensive.

Padding with fake documents in the index can be viewed as an obfuscation strategy. Obfuscating search queries has been studied to achieve user anonymity (Ahmad et al., 2018, 2016). There is a line of work with data perturbation or obfuscation for differential privacy in classification (e.g.  (Jagannathan et al., 2009; Liu et al., 2017)) and in neural model or representation learning (Chase et al., 2017; Habernal, 2022).  Chen et al. (2018) introduced a differentially private obfuscation framework to mitigate access pattern leakage in searchable encryption and it intentionally includes false positives and false negatives in the index while adding redundancy by encoding a document into multiple shards based on erasure coding.  Shang et al. (2021) proposes obfuscated searchable encryption to improves Chen et al.’s scheme by a fresh obfuscation per query with a tradeoff at a larger search cost. Its query processing has a time complexity of \(O(\frac{n\log n}{\log \log n})\) where n is the number of documents in the index and that is not scalable for a large dataset. Both of these two papers do not address document ranking, and they deal with single-word queries only while this paper addresses document retrieval in a more complex disjunctive multi-keyword query setting.

ORAM and hardware enclaves  Garg et al. (2015) proposed search encryption based on oblivious RAM (ORAM) to avoid the access-pattern leakage at a cost of high search overhead for a large dataset. Document retrieval leveraging hardware technologies such as Intel SGX and/or ORAM is studied in Sun et al. (2018), Mishra et al. (2018), Hoang et al. (2019), Shao et al. (2020), Vo et al. (2021). ORAM techniques are extremely expensive without such hardware (Mishra et al., 2018). On the other hand, the risk of privacy-sensitive attacks exists on such a platform (Costan & Devadas, 2016; Brasser et al., 2017; Xu et al., 2015) and a client may not fully trust such hardware owned by servers. This paper does not use such a hardware.

3 Indexing and query processing

To allow a server to gather meaningful feature information from the hosted index in a privacy-aware manner with client authorization, we propose a retrieval scheme with query-specific document tagging and will call it QDT. QDT leverages a cryptographic technique for information blinding used in OXT (Cash et al., 2013, 2014), but revises it for disjunctive query processing. The long-bit arithmetic is required for the involved cryptographic operations, which is expensive and significantly slows down query processing time. With this in mind, QDT adopts a two-level index structure using a two-dimensional representation of document IDs and derives a query-specific document tag to identify a document that appears in multiple accessed posting lists. This two-level structure helps QDT to control the time complexity of runtime tag computation. We also further develop guided padding of fake documents to enhance hiding of statistical term co-occurrence information in the index.

Like OXT, the formula of QDT employs pseudo-random functions (PRF), denoted as \(H_i(x)\), which are either AES-256 (Dworkin, 2001) or a cryptographic hash function such as SHA-3 (Dworkin, 2015). They use the secret key called \(k_i\), only known by the client. A PRF function is applied to search terms, document IDs and other index information for privacy protection. The formula in the rest of this section involves modular arithmetic of integers and will use a Diffie-Hellman Group (Boneh & Shoup, 2015) of size q defined as the integer set \(\{1, g, g^2 \mod Q, \cdots , g^{q-1} \mod Q \}\), where g is called a generator, and both q and Q are large primes satisfying \(g^q \mod Q \equiv 1\) and \(q < Q\). The security assumption of the Diffie-Hellman group is based on the fact that given \(g^x \mod Q\), it is computationally hard to infer any useful information of x, even g, q, and Q are public. More details on this subject and typical parameter values can be found in Section 10.4.2 of an online book (Boneh & Shoup, 2023) and in Appendix 4 of an NIST standard (Barker et al., 2018) and related RFCs (Kojo & Kivinen, 2003; Gillmor, 2016). Note that while q and Q can be typically in 2024 bits or more, a Diffie-Hellman group can also be built and represented using Elliptic Curve arithmetic with a pair of two numbers in a relatively smaller number of bits such as 224 bits (Boneh & Shparlinski, 2001) to speed up calculation.

Table 1 lists frequently used notations through this paper, and it does not include some symbols that are used locally.

Table 1 Frequently used notations

3.1 Bucketed posting lists

Our scheme divides all private documents into groups and a document ID d is represented as \(d=(gid, mid)\) where gid is its group ID and mid is a member ID in such a group. By using group IDs, a posting list is decomposed into a set of buckets, and each bucket only contains documents of the same group ID. Thus the inverted index is structured in two levels: a term points to buckets of documents, and each bucket captures a subset of a document group. The inverted index of QDT does not reveal group IDs directly. Instead, the p-th bucket with group ID gid in a posting list of term w is marked by a bucket tag defined as

$$\begin{aligned} B(gid,w,p) = H_1(gid)\cdot (H_2(w|| (p \bmod P)))^{-1} \mod q \end{aligned}$$
(1)

where bucket position p is an integer counted from 1. Pseudo-random functions \(H_1\) or \(H_2\) use secret keys \(k_1\) and \(k_2\) respectively. Expression \((H_2(x))^{-1}\) means the modular multiplicative inverse of integer \(H_2(x)\), namely, \(H_2(x) \cdot (H_2(x))^{-1} \equiv 1 \mod q\). Symbol \(\cdot\) denotes modular multiplication. We adopt this inverse operator, inspired by OXT (Cash et al., 2013, 2014) for blinding, so that a server cannot learn blinded information from the index without client authorization. That is because the above expression for each bucket tag is computed during index generation before the index is outsourced to a server. Such a server only knows the final value of a bucket tag and does not know its integer factors including \(H_1(gid)\). As discussed in Sect. 3.2, a deblinding token is sent from a client as a piece of query-specific authorization information to remove this blinding factor during query processing.

Notice that in the above expression, the modulo operator is applied to p as we adopt a modulo positioning technique (Lipmaa et al., 2000; Cash et al., 2013) which allows the cyclic re-use of the deblinding tokens from a client during query processing. Its impact on complexity control will be further discussed in Sect. 3.3. Symbol || denotes concatenation between two integers in binary representation, e.g., \(11_b||01_b = 1101_b\).

Each bucket contains a number of posting records and each of them represents a document containing the targeted term w with the same group ID. The key to each posting list is encoded by a PRF called \(H_0(w)\) using secret \(k_0\). Each posting record for each document \(d=(gid,mid)\) is a tuple \((E(d), H_3(mid), F(d, w))\) where

  • E(d) is a symmetric encryption result of document ID d, e.g., AES-256 with CBC mode (Dworkin, 2001). A document d may appear in different posting records and we use different seeds to compute the corresponding E(d) values. Notice that this field E(d) is needed only for a unigram term. The posting records that refer to the same document will be recognized with a query-specific document tag, and only one of these records needs to host the encrypted document ID, and this will be discussed in Sect. 3.2.

  • \(H_3(mid)\) is the hashed member ID using a PRF function with secret key \(k_3\). Note that tuple \((B(gid, w,p), H_3(mid))\) uniquely identifies a document within the posting list of a term, but this bucket tag paired with mid is not sufficient to identify a document that appears in different posting lists, because the same document can have a different bucket tag in different posting lists.

  • F(dw) is the encrypted or encoded element-based or vector-based feature of document d under feature key w. Ranking feature encoding and obfuscation are addressed in  Ji et al. (2018), Shao et al. (2019) for tree based and neural ranking. Section 4 discusses feature value partitioning based obfuscation (Hacigümüş et al., 2002) used in our evaluation.

Fig. 1
figure 1

An example of two-level inverted index with 3 group buckets

Figure 1 illustrates an example of the inverted index where the original posting list of term w has four real documents \(d_1\), \(d_2\), \(d_5\), and \(d_6\). Under QDT, the corresponding two-level index is organized with 3 group buckets. These buckets contain the above 4 real documents and also have padded extra 4 fake documents marked with symbol \(\bot\). Fake document padding will be explained in Sect. 3.4. The first bucket contains 3 posting records. There are two real documents \(d_1\) (with member ID \(mid_1\)) and \(d_2\) (with member ID \(mid_2\)) containing term w in Group \(g_a\). Similarly, the second bucket contains 3 posting records while the third bucket contains 2 posting records. Modulo positioning uses \(P=2\) and thus the third bucket has a bucket tag \(B(g_c,w,3)=B(g_c,w,1)\).

3.2 Online query processing with QDT

Figure 2 shows the flow of query processing with QDT. Given a query with m terms as \(H_0(w_1), \cdots , H_0(w_m)\), a client samples a random integer R, and computes a deblinding token at p-th bucket for each of search terms w as:

$$\begin{aligned} \textsf {dtoken}(w,p) = g^{R \cdot H_2(w|| (p\bmod P))} \mod Q. \end{aligned}$$
(2)

Number P in the above expression limits the maximum number of tokens needed and this client sends all m terms \(H_0(w_1), \cdots , H_0(w_m)\) and all \(m\cdot P\) authorization deblinding tokens \(\textsf {dtoken}(w_i,p)\) for \(1 \le i \le m\) and \(0 \le p < P\) to the server.

After the client sends the above encoded terms and tokens to a server, this server linearly visits the bucketed posting lists of the searched terms one by one in the hosted inverted index. It computes a query-specific group tag at bucket position p for term w as

$$\begin{aligned} \textsf {gtag}= \textsf {dtoken}(w,p)^{B(gid,w,p)} \mod Q, \end{aligned}$$
(3)

which is considered as a group tag for all documents in this posting bucket. Expanding the value of \(\textsf {dtoken}(w,p)^{B(gid,w,p)}\) using Expression 1, we have:

$$\begin{aligned} \textsf {gtag}=\,& {} g^{R \cdot H_2(w|| (p\bmod P)) \cdot H_1(gid)\cdot (H_2(w|| (p \bmod P)))^{-1} \mod q} \mod Q \nonumber \\=\, & {} g^{R \cdot H_1(gid)} \mod Q, \end{aligned}$$
(4)

which is unique based on the group ID gid for the given query since value R is the same for different search terms of this query. This group tag is query specific, since R is chosen randomly for each query. Appendix 1 discusses this design further on preventing an attack.

Within a bucket, for each posting record, the server fetches its member ID information. The pair \((\textsf {gtag}, H_3(mid))\) acts as a query-specific document tag and uniquely identifies a matched document that owns the feature stored in this record. Using this query-specific document tag, the runtime system builds a query-specific in-memory key-value hash table with this document tag as a key. The value of this key represents the corresponding set of features.

As the above index traversal linearly visits all posting records of multiple search terms, the features for the same document that appears in multiple posting records are added to the hashtable gradually with the same query-specific document tag. Then the ranking features of this document can be recognized and aggregated.

The outcome of the above traversal is a set of matched documents and the features for each document accumulated in the above hashtable. This outcome can be incrementally injected as an input to a privacy-aware ranking scheme. As discussed in Sect. 2, our evaluation paper uses a simple obfuscation-based additive ranking which hides true BM25 values. Alternatively, one can opt to apply privacy-aware tree ensemble or neural ranking (Ji et al., 2018; Shao et al., 2019). Finally the server sends the ranked and encrypted document IDs to the client as shown in Fig. 2. The client will decrypt and may filter out some fake IDs to be discussed in Sect. 3.4.

To optimize time efficiency, the previous retrieval work with additive ranking has developed index skipping methods called WAND (Broder et al., 2003), BMW (Ding & Suel, 2011) and its variants (e.g.  (Mallia et al., 2017; Shao et al., 2021)). These optimization methods avoid the processing of low-scoring documents below a top-k threshold to reduce the retrieval latency. We do not use such an optimization method because it requires that each posting list is presorted by static and deterministic document IDs to guide skipping. In our problem context with privacy protection requirement, document IDs in a search index are encrypted non-deterministically and the same document has different encrypted ID values in the different lists. Similarly, other optimization on list intersection (Culpepper & Moffat, 2010) that requires document pre-sorting cannot be leveraged. This represents an efficiency tradeoff for privacy.

Fig. 2
figure 2

The flow of privacy-aware search with QDT

3.3 Time cost of query processing with QDT

Having the two-level index design does not affect the effectiveness of document retrieval in terms of ranking quality. The main motivation of having this design is to control the time complexity. Assume that client–server communication cost is less significant to transmit encoded query terms and tokens, and top k results with relatively small k value. We assume to use the aforementioned obfuscated BM25 ranking, which is relatively fast and its cost is less significant compared to the other items listed below. Then the time complexity of query processing with QDT is dominated by the following expression:

$$\begin{aligned} m \cdot P \cdot T_{token} + m\cdot [T_{kvstore}+ C \cdot T_{gtag} + L \cdot T_{hashtb}], \end{aligned}$$

where

  • m is the number of search terms derived from a given query;

  • P is the modulus used in modulo positioning;

  • \(T_{token}\) is the client-side time to generate one deblinding token;

  • \(T_{kvstore}\) is the average server-side time to access a posting list from a key-value store hosted on a disk drive;

  • \(T_{gtag}\) is the server-side time to compute the group tag of a posting bucket;

  • C is the average number of posting buckets in a posting list;

  • L is the average total number of documents in the posting list per term;

  • \(T_{hashtb}\) is the server-side time to perform a hash table lookup and/or insertion using a query specific document tag, and then combine the features of the corresponding document in this hash table.

Notice that cost parameter \(T_{token}\) is much larger than \(T_{kvstore}\), and L is fixed. As shown below, controlling of parameters C and P is critical for a reasonable response time.

Cost parameter \(T_{gtag}\) is for computing a group tag with Expression (3) and involves integer exponentiation with a Diffie-Hellman group. Integers involved need to be in 2048 bits or more to be secure as a common cryptographic practice (Boneh & Shoup, 2015), and such exponentiation is extremely expensive. We use an elliptic curve based optimization to re-formulate exponentiation computation, represent each long-bit number with a pair of two numbers, and reduce the number of bits needed as 224 (Boneh & Shparlinski, 2001) instead of 2048 to speed up calculation, and it is still time-consuming. In our evaluation, \(T_{gtag} \approx 0.13ms\) while \(T_{hashtb}=0.001ms\) using 256-bit elliptic curve-based modular exponentiation. With a standard one-level approach where \(C=L\), the cost of tag computation would dominate. For 10,000 buckets and 5 search terms, group tag computation in processing a query would take around 6.5 s, and it is too slow for an interactive response. Our two-level design can allow \(C<<L\), which makes the tag computing time more affordable.

It is expensive for a client to compute deblinding tokens because token computation involves modular exponentiation, and in our evaluation, \(T_{token} \approx 0.13ms\). The use of modulo positioning (Lipmaa et al., 2000; Cash et al., 2013) reduces the client token generation time because \(\textsf {dtoken}(w,p) = \textsf {dtoken}(w,p \mod P)\), and parameter P limits the number of deblinding tokens computed at a client. Without limiting P, \(P=L\) and client-side token computation would become a bottleneck.

With the above design consideration with \(C<< L\) and limited P value, the dominating part of query processing time cost of QDT is proportional to \(m\cdot L\). Under a typical term distribution in an inverted index, \(L \ll n\) where n is the total number of documents in a dataset. QDT can take advantages of an inverted index data structure for time efficiency like a traditional retrieval algorithm while there is still a performance gap due to the cryptographic operations used, which will be evaluated in Sect. 4.

A comparison with the previous work in design considerations While QDT adopts inverse-based blinding, following OXT (Cash et al., 2013), there is a major design difference of QDT compared to OXT for supporting the disjunctive query semantics while collecting ranking features correctly and efficiently. OXT combines a blinded static document ID d with a term ID w as a lookup key in a pre-computed static hash table, to check if document d contains term w in an index. This is designed for conjunctive query handling and may leak the cross-query two-term co-occurrence information. In comparison, QDT uses Expression (4) as the part of a query-specific hash table key which is term-independent and contains a query-dependent random number, and this is helpful in reducing inter-query information leakage of document sharing among posting lists involved in different queries.

For time efficiency, there is a significant difference between OXT and QDT due to the above design. OXT’s list intersection scans the posting list of the first search term. For each document d in the first list, OXT uses key (dw) to check if d contains w in a hash table for another term w. OXT’s time complexity is dominated by expression \(\Theta (m*L*(T_{keycomp} +T_{lookup}))\) where m is the number of search terms, L is the average posting list length, \(T_{keycomp}\) is time to compose a key involving the exponentiation of long-bit integers, and \(T_{lookup}\) is the time to perform a hash table lookup. Parameter \(T_{keycomp}\) is as expensive as \(T_{gtag}\) in QDT, and thus there is a large cost difference between the values of \(m*L*T_{keycomp}\) in OXT and \(m*C*T_{gtag}\) in QDT because \(L>>C\) due to the two-level design in QDT. The hashtable tag lookup cost in QDT is less significant, because this query-specific hash table is small, while the hashtable involved in OXT is huge, and typically does not fit in memory.

Partial server additive ranking in  Agun et al. (2018) uses the OXT’s technique of inverse-based blinding, and thus can leak the cross-query two-term co-occurrence information. It follows the OXT’s hashtable structure and lookup method driven by the first search term, requiring a large server memory to host its hash table. In addition, it does not allow multi-stage search with complex neural re-ranking conducted at the server side, and the server has to send a large amount of un-ranked results to the client-side with a significant communication overhead, which is not scalable for a large dataset with many results matching a query.

The HXT work in  Lai et al. (2018) improves the security of OXT for conjunctive queries using hidden vector encryption and a Bloom filter with substantial overhead in multi-round client–server communication. Adopting HXT’s encryption design with client–server collaborative communication at least would increase communication overhead by a factor proportional to the number of search terms. Multi-round communication between the server and client used in HXT and the previous work (e.g.  (Naveed et al., 2014; Hu et al., 2011)) or extensive communication to rely more on clients (Agun et al., 2018; Xu et al., 2021) is not considered in this paper because they incur high communication cost and do not scale well for large datasets or a low-bandwidth platform. Like HXT, the work in  Cash et al. (2013) also discusses the use of a Bloom filter to improve efficiency. Our work does not adopt the idea of a Bloom filter because with that, documents that do not satisfy query semantic may be matched due to false positives and wrong feature aggregation for some documents is possible, which yields incorrect ranking results.

3.4 Guided padding with fake documents

We adopt the idea of Kamara and Moataz (2019), Patel et al. (2019) to pad fake documents, which hides the length of real posting lists. We add a flag to a fake document ID before its encryption, thus a server is unable to detect fake documents added in an index and only a client can filter out fake document IDs mixed in the returned query results after ID decryption. A naive padding method based the idea of the previous work is to randomly add fake documents to each posting list and select the group ID and member ID of these fake documents from a large integer value space. This naive method can still leak document sharing patterns in an index and the reason is that an adversary server can identify a document approximately using its encoded member ID value when such values are relatively unique among documents, sampled from a large integer space, which is of size \(2^{64}\) bits in our implementation.

There are three additional considerations. (1) The ranking feature values of fake documents cannot look dissimilar to real documents so that the server would not be able to detect fake documents easily. (2) The existence of fake documents should not affect the effectiveness of top candidate selection in ranking. (3) We want to add fake documents to the existing groups as much as possible because online computing of group tags is expensive and the addition of new groups needs to be avoided.

With the above considerations, we need a padding strategy different from that of Kamara and Moataz (2019), Patel et al. (2019). Our goal is to not only hide the distribution of the posting list lengths but also make padded documents indistinguishable from real documents while limiting the addition of new group IDs. Our idea is to use ID information and the feature value distribution of real documents to guide fake document generation, and also obfuscate these hashed member IDs with k-anonymity (Sweeney, 2002; Di Castro et al., 2016). Namely, each real member ID corresponds multiple documents appearing in a posting list. That is accomplished by mixing this member ID with \(k-1\) randomly chosen fake member IDs and any attacker cannot distinguish. The example in Fig. 1 illustrates the inclusion of 4 fake documents where value \(mid_1\) appears 3 times, and Document \(d_1\) cannot be distinguished well by only using value \(mid_1\).

Let U be the maximum padding factor. Given a posting list of term w with length r, we select a random number u in interval \([1, U \cdot r]\) and u represents the total number of fake documents to be added. Let \(G'\) be the set of all possible group IDs available to choose for a fake document, and our guided padding algorithm to add u fake documents to this posting list repeats the following steps:

  1. 1.

    Let G memorize all group IDs used so far. Initially G contains all group IDs used in this given posting list. The algorithm gradually expands G towards \(G'\) to cover more valid group IDs when needed. This allows the distribution of fake member IDs to be about the same as that of real member IDs.

  2. 2.

    Randomly sample \(mid'\) from \(D_{mid}\) and \(feat'\) from \(D_{feat}\). Here \(D_{mid}\) is the distribution of the member IDs of real documents in the entire index and \(D_{feat}\) is the distribution of the ranking features of the real documents in the entire index.

  3. 3.

    Randomly sample \(gid'\) from G such that \((gid', mid')\) has not been appeared in this posting list. Add \((gid', mid', feat')\) as a new fake document with feature \(feat'\) to this posting list.

Since ranking features of fake documents follow the real feature distributions in our padding method, fake documents have an equal chance to be selected on the top in ranking. With an average of maximum padding ratio being U, the number of fake documents added is about \(\frac{U}{2}\) in the top results, an earlier stage ranking algorithm should enlarge search scope proportionally and submit about \(K'(1 + \frac{U}{2})\) top results to the next-stage ranking.

4 Evaluation

Datasets and setting We use the following TREC test collections for evaluations because they are widely adopted for ad-hoc keyword search studies (e.g.  (Guo et al., 2016; Xiong et al., 2017; Dai et al., 2018; Ji et al., 2018; Shao et al., 2019)): (1) Robust04 uses TREC Disks 4 & 5 (excluding Congressional Records), which has about 0.5M news articles. (2) ClueWeb09-Cat-B uses ClueWeb09 Category B of 50 M web pages (Jamie Callan’s research group, 2020). Spam filtering is applied on ClueWeb09-Cat-B using Waterloo spam score with threshold 60. For ClueWeb, we have further divided the dataset into 60 partitions for parallel query processing thus the report time is the parallel time with 60 cores. For Robust04, the average number of documents for each posting list is 9789, while the average number of buckets is 355. For ClueWeb, the average number of documents for each posting list is 15,462, while the average number of buckets is 394.

The relevance features include BM25 term scores for single words and word pairs in the title and body sections. To control the number of word-pair terms, we limit word pairs under distance 3 in indexing and query term generation. BM25 features are obfuscated by 50 non-uniform value partitioning with extreme down-sampling (Hacigümüş et al., 2002; Ryoo et al., 2017) to hide privacy-sensitive details. We leveraged Indri (Strohman et al., 2005) to generate the posting lists and produce a two-level index. We have used 256-bit elliptic curve optimization instead of using 2048 bit long integers (Boneh & Shparlinski, 2001) as discussed in Sect. 3.3. The majority of indexing time comes from encryption of document IDs and bucket tag computations with 256-bit elliptic curves, which is still slow. For example, it takes about 960 CPU hours to generate the encrypted index for ClueWeb, and this is about 64x slower than traditional indexing with Indri (Strohman et al., 2005) without encryption.

Our evaluation uses 250 queries with average query length 2.64 from TREC Robust 2004 and 2005, and 200 queries with average query length 2.48 from TREC Million Query 2009 to 2012 for ClueWeb. Experiments are conducted on Linux servers and each has Intel i5-8259U 2.3GHz, 32GB DDR4 memory and NVMe solid-state drives (SSD). We have implemented QDT in C++ and the code is compiled with flag -O3. We report the average response time in query processing and the relevance score which includes the normalized discounted cumulative gain (NDCG) (Järvelin & Kekäläinen, 2002) and the precision of the top results (Baeza-Yates & Ribeiro-Neto, 2011). NDCG@p with a value between 0 and 1 measures the quality of a ranking result against ideal ranking for top p positions. All reported query time numbers x are within 95% confidence interval \(x \pm 0.01\) by gathering data through multiple runs.

4.1 Time efficiency and ranking relevance

Search relevance in the presence of fake documents Table 2 lists the relevance score of QDT in NDCG and precision (Järvelin & Kekäläinen, 2002) for Robust04 and ClueWeb after filtering out padded fake documents. It also lists the relevance scores of several up-to-date baselines for ranking with and without privacy constraints in searching the tested TREC datasets. Row 3 is for QDT with BM25-based ordering without using fake documents and feature obfuscation. Row 4 is for QDT with BM25 ranking with obfuscated features and fake documents. Row 5 is for re-ranking with a privacy-aware neural model called ConvKNRM/TOC (Shao et al., 2019; Dai et al., 2018) after QDT retrieval and obfuscated BM25 first-stage ranking. Since final results received by a client contain fake documents, the server needs to send more top results than needed, depending on the maximum padding ratio. Ranking selects top 1000 documents when no fake documents are padded, and top 1500 documents when the maximum padding ratio is 1. The result shows that our padding of fake documents has no visible impact on relevance score for Robust04. We have performed a pairwise t-test on the retrieval with and without fake documents, which shows that there is no statistically significant difference at the 95% confidence level. For ClueWeb, although there is a small relevance degradation during retrieval, almost all desired documents still appear in top retrieved results.

Table 2 compares QDT with two baselines for privacy-aware ranking. Row 7 is for partial server-side additive ranking with a linear combination (Agun et al., 2018) (we call it PAR). PAR underperforms QDT with neural ranking due to two reasons: (1) It does not use semantic matching with neural embeddings. (2) It uses conjunctive query semantic, reducing the chance of finding semantically-relevant documents that do not contain all query keywords. Row 8 is the relevance result of ConvKNRM/TOC published in  Shao et al. (2019), which assumes the top results and their ranking features are fetched safely with a privacy protection. Its result is comparable on average to Row 5. Thus QDT provides a privacy-aware proper retrieval support for neural re-ranking with ConvKNRM/TOC.

As a reference, Table 2 lists the relevance score of two recent BERT-based ranking methods without privacy constraints. Row 10 is for SPLADE v2 (Formal et al., 2022, 2021) with a learned sparse index and knowledge distillation. Its Robust04 performance is collected from  Thakur et al. (2023). Row 11 is for BECR composite re-ranking (Yang et al., 2022) which uses transformer-based contextual embeddings for document and query representation. These latest methods without privacy constraints outperform QDT in Row 5 that uses static neural embedding while the gap is modest with 4% and 9.4% for Robust04, and 9.4% for ClueWeb. They represent complementary and orthogonal optimizations and leveraging such optimization in QDT will be a future work to exploit. Addressing privacy in transformer-based models is still an open problem in general, and their query processing is much more expensive due to their complexity. Thus our current work represents a tradeoff in advancing the state of the art on privacy-aware ranking.

Table 2 Client-side view of relevance for ranking with and without privacy constraints
Table 3 Query time in seconds with max padding ratio 1

Time cost breakdown in query processing with QDT Table 3 lists the average query processing time and its cost breakdown when the number of query words varies. Row 1 is the query length and the number of query words used for each length. For both datasets, the maximum padding ratio is 1 and on average 33% of documents are fake. The query response time breakdown is listed as follows. Rows marked “Client” are client-side preprocessing time mainly dominated by token generation; Rows marked “Comm.” means the client–server communication of sending tokens and encoded terms, and receiving the top ranked results, and we have assumed networking speed of 10ms latency and 5Mbits/s bandwidth; Rows marked “Tag comp.” are the server-side time for group tag computation; Rows marked “Hash table” are the server-side time to combine features by document tags through a hash table; Rows marked “Decrypt” are the client-side post-processing time to decrypt ranked document IDs and filter out fake documents.

Table 3 shows that the client-side cost of computing deblinding tokens is not significant with the use of modulo positioning. The group tag computation takes a significant portion on the server side, and this server-side cost is proportional to the number of buckets. Thus two-level indexing is effective to control this portion of the cost. As the query length increases, the number of deblinding tokens and involved posting lists increases, and thus there is an increasing amount of client-side token computation and server-side group tag computation. The cost of hash table operations for feature aggregation is relatively small. The time on posting list access from the SSD-based key-value store is small as the number of store lookups is equal to the number of search terms.

Impact of two-level indexing on retrieval time Table 4 shows the query time in three settings when maximum padding ratio 1: (1) one-level index with no group buckets (Column 3); (2) two-level index with 1024 groups but without using modulo positioning (Column 4); (3) two-level index with 1024 groups and modulo positioning with \(P=128\) (Last column). Only the cost of client-side token computation, client–server communication, and server-side tag computation is listed as grouping only affects these three items significantly. Without two-level indexing, there is much more cost in client-side token computation and server-side tag computation. For ClueWeb, two-level indexing without modulo positioning reduces the average number of deblinding tokens needed per term from about 15,462 to 1,024 in this case. Modulo positioning furthers reduces the number of tokens to 128, but the number of bucket group tags to compute does not change. The communication cost is also reduced proportionally with fewer deblinding tokens. The overall speedup is up to 130x after the use of the two-level design from the ungrouped setting. Modulo positioning brings an additional 1.8x speedup. The speedup for Robust04 is 131x with the two-level design.

Table 4 Query time with or without two-level indexing

Impact of padding on query time and storage cost Table 5 lists the query time and storage space cost without padding, and with maximum padding ratio of 0, 1, 2, and 3. For ClueWeb, the storage cost listed is the total gigabytes for all partitions. Impact of adding fake documents to buckets on the query time is mainly on hash table operations since this mainly enlarges the number of posting records per list randomly. The storage space proportionally increases when the maximum padding ratio increases since the posting lists consume a majority portion of space.

Table 5 Impact of padding on query time and space cost

Storage cost 32 bytes are used for the ID of a posting bucket. Each posting record for a unigram takes 38 bytes, which includes 32 bytes encryption of document ID (initialization vector and one AES block, each of which needs 16 bytes), 2-byte hashed member ID, and 4-bytes for basic features. For a word pair, the document ID field is not needed as explained in Sect. 3.1. Since values of document ID encryptions and bucket tags generated are uniformly distributed long-bit random integers following the standard cryptographic requirement, the index of QDT cannot be significantly compressed. In comparison, the traditional index produced by Indri (Strohman et al., 2005) takes about 107GB for ClueWeb and hence the QDT index without no padding (\(U=0)\) has about 4.8x more space cost. This is mainly caused by the non-compressibility of random integers with long bits in our index and the use of word-pair-based terms.

Table 6 Retrieval time in seconds with and without privacy constraints

Comparisons of retrieval time with and without privacy constraints Table 6 lists a baseline comparison of document retrieval time without or with privacy constraints. Indri (Strohman et al., 2005) is an open-source search engine. Each of VBMW/p and DBMW/p (Shao et al., 2021) is a recent variant of BMW (Ding & Suel, 2011; Mallia et al., 2017), which optimizes index navigation by retrieving documents in blocks to skip low-score documents. The above retrieval methods do not consider privacy. In comparison, QDT response time is 0.50s and 0.63s respectively where the hash-table based feature gathering during list traversal costs around 0.11s and 0.19s as shown before in Table 3. The higher latency in QDT compared to VBMW/p and DBMW/p represents a tradeoff for privacy. As discussed in Sect. 3.2, we do not use BMW-based optimization in Shao et al. (2021), Ding and Suel (2011), Mallia et al. (2017) because it requires each posting list to be presorted by static and deterministic document IDs, which is not possible in our privacy-aware setting where the same document has different encrypted ID values in the different posting lists.

Excluding tag computing cost, the list traversal of QDT is dominated by the conversion from a 512-bit document tag to a 32-bit hash-table key (\(\sim\) 70%), and without index skipping, it is slower than VBMW/p and DBMW/p while gaining privacy. Excluding these two factors, our list traversal time is on a par with the existing work. This comparison reconfirms that the price paid for supporting privacy is significant because of expensive cryptographic computation with very long bits. That also means optimization studied in this paper is necessary to bring down the cost. Our scheme can deliver a sub-second response time after using the proposed optimization, otherwise, the cost would be about 130x bigger as shown in Table 4.

Table 6 also lists the retrieval time of PAR ranking (Agun et al., 2018) under the same setting with privacy-aware partial server ranking. PAR is an extension and improvement over OXT (Cash et al., 2013) for ranking with conjunctive queries. QDT is faster than PAR, even excluding the client-side communication cost incurred in PAR. Notice that PAR has a lower relevance as discussed in Table 2, partially because of its conjunctive constraint. The extension of PAR (called PAR/D) to follow disjunctive query semantics is listed in Table 6 as well. The result shows that QDT is 2.20x faster than PAR/D for Robust04 and is 6.83x faster for ClueWeb. PAR/D converts a disjunctive query as several conjunctive subqueries with different s-terms. For each subquery, PAR/D calculates the union of documents that match such a subquery and aggregates ranking features for the same document IDs using OXT-based hashtable lookups. Since PAR with partial server-side ranking requires a client to conduct final ranking, there is a large amount of data sent from the server to the client. The test setup in this evaluation has 100 megabits per second communication bandwidth between the server and a client, which is reasonable on average with a home cable or 5 G cellular Internet connection in USA. The client–server communication in PAR/D takes 0.03 s for Robust04 and increases to 2.7 s for ClueWeb. This large communication cost increase is caused by the fact that ClueWeb dataset is 100x larger than Robust04. Excluding this client–server communication overhead, QDT is still 60% faster for Robust04 and 2.54x faster for ClueWeb compared to PAR/D. In general, QDT provides more flexibility to reach a higher relevance score with a lower cost because PAR or PAR/D does not conduct full server ranking and it pays a significant communication overhead to let a client collect a large amount of data for further ranking, in addition to the hashtable cost issue explained in the end of Sect. 3.3.

4.2 Guided padding for attack prevention

We demonstrate a plaintext-recovery attack exploiting the co-occurrence leakage (Islam et al., 2012) of two or three words to recover the plaintext of encoded word IDs appeared in a hosted index and examine how guided padding in QDT discussed in Sect. 3.4 thwarts such an attack. This 2-word attack makes the following assumptions:

  1. 1.

    A server hosts the index for dataset D (Robust04 in this case) and knows the plaintext of a set of English words included in D, called E. We assume 10% of documents in D are public or injected by the server adversary by some means and the server can use them to approximate the co-occurrence probability of two words in E among documents of D. The corresponding co-occurrence matrix is called M. Recall that the co-occurrence probability of two words \(w_1\) and \(w_2\) is the number of documents in D that contain both \(w_1\) and \(w_2\), divided by |D|. The server does not know the IDs of English words in E and it wants to recover plaintext-to-ID mappings for some of these words.

  2. 2.

    The server estimates the co-occurrence probability of two word IDs using one of the two methods below and the corresponding probability matrix is called \(M'\):

    • A) Observe processing of some 2-word queries without knowing their plaintext, and these query word IDs form Set \(E'\). We assume \(E' \subseteq E\). By observing the execution of QDT, the server can guess two matched documents for a query in different postings are the same if their query-specific document tags are the same, even such a document could be fake. Let Y be the fraction of all possible word pairs formed from \(E'\) whose co-occurrence can be estimated using Method A.

      Initially, \(Y=1\), namely all pairs of two words in \(E'\) appear in these observed queries. We will vary Y value to assess if our finding is similar when not all pairs of the two words in \(E'\) appear in these queries.

    • B) Inspect the inverted index, and guess two documents appeared in the postings are the same if their group member IDs are equal.

  3. 3.

    The server has already obtained a mapping from word IDs to the plaintext for a small number (X) of English words in \(E'\). The goal of server attack is to recover the ID mapping of plaintext for the English word IDs in subset \(E'\). In our evaluation, \(|E'|=150\) and we vary the value of parameter X from 20 to 0 and parameter Y from 1 to 0.1.

Attack method The server approximates the co-occurrence probability matrix \(M'\) for set \(E'\) using the above Method A or B. Next, the attack algorithm follows the simulated annealing technique (Kirkpatrick et al., 1983) used in Islam et al. (2012) to find a mapping from the word IDs in matrix \(M'\) to the plaintext words of a sub-matrix of matrix M with a minimized co-occurrence matching error.

Table 7 shows the number of recovered English words in Set \(E'\) in Columns from 2 to 4 marked “Eq. member ID guess” by using Method B based on equal member ID values to guess the term co-occurrence probability. X varies from 20 to 0 in these 3 cases. The rest of the columns uses Method A by collecting which documents have the equal query-specific document tags during query processing under various values of X and Y. The attack computing time for each configuration setting takes a few hours. It should be noted that the server does not know if the recovered word mapping from an ID to a plaintext is correct or not, and thus entry value 14 means that the server’s plaintext recovering accuracy is 14 out of 130 unknown words given \(|E'|-X=130\).

The first column from Row 2 in Table 7 lists the different padding methods used as discussed in Sect. 3.4. We explain the second column with \(X=20\) as follows. If there is no padding of fake documents (Row 2), Though matrix M is approximated, the recovery accuracy of this attack is 14 out of 130, given 20 known mappings. When the naive padding method is used, the selection of random IDs for fake documents is unbiased and these IDs are unlikely to collide. Therefore, this strategy does not obfuscate the co-occurrence information, and the server can still recover 13 English words in \(E'\). By applying the guided padding with padding ratio \(U=1\), the recovery accuracy decreases to 2 out of 130. As U goes higher, \(M'\) is totally different from M with much more noise. As a result, no words are recovered. Table 7 also varies X and Y values. As X becomes smaller, less known ID-to-word mappings are available to guide the attack to perform successfully. For each setting, guided padding decreases the server’s recover accuracy significantly and it becomes zero when \(U \ge 1\) or \(U \ge 2\). As Y becomes smaller, less information is available to guide simulated annealing in finding a good word matching, and less words are recovered.

Table 7 Recover 150 words under various padding strategies

Our attack using 3-word co-occurrence leakage is similar as above with a 3D co-occurrence matrix. Following a similar setting as Table 7, we find that a server adversary can recover up-to 3 words by exploiting the 3-word co-occurrence with naive padding or without padding, and our guided padding strategy effectively decreases the number of recovered words to zero with \(U \ge 1\).

We have also evaluated QDT under a count attack (Cash et al., 2015) which further exploits the use of a posting list length under a slightly different co-occurrence assumption. The results show that guided padding is still effective and decreases the number of recovered words to zero with \(U\ge 1\) or 2.

5 Concluding remarks

The contribution of this paper is a privacy-aware document retrieval scheme with query-specific tagging and two-level indexing to minimize the leakage of sharing patterns when matching documents and gathering their features for ranking. Our evaluation shows that two-level indexing in QDT can control query processing cost with up-to 131x time speedups in the tested cases compared to a single-level design, and guided padding can effectively thwart leakage-abuse plaintext attacks that exploit word co-occurrence information. Compared to a traditional retrieval scheme with no privacy constraints, QDT does pay a significant space and time cost for enhancing privacy protection with the use of cryptographic operations on long-bit numbers, and this also demonstrates the importance of efficiency optimization proposed in QDT.