Pattern Matching on Encrypted Streams

Desmoulins, Nicolas; Fouque, Pierre-Alain; Onete, Cristina; Sanders, Olivier

doi:10.1007/978-3-030-03326-2_5

Nicolas Desmoulins¹⁵,
Pierre-Alain Fouque¹⁶,
Cristina Onete¹⁷ &
…
Olivier Sanders¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11272))

Included in the following conference series:

International Conference on the Theory and Application of Cryptology and Information Security

2422 Accesses
17 Citations

Abstract

Pattern matching is essential in applications such as deep-packet inspection (DPI), searching on genomic data, or analyzing medical data. A simple task to do on plaintext data, pattern matching is much harder to do when the privacy of the data must be preserved. Existent solutions involve searchable encryption mechanisms with at least one of these three drawbacks: requiring an exhaustive (and static) list of keywords to be prepared before the data is encrypted (like in symmetric searchable encryption); requiring tokenization, i.e., breaking up the data to search into substrings and encrypting them separately (e.g., like BlindBox); relying on symmetric-key cryptography, thus implying a token-regeneration step for each encrypted-data source (e.g., user). Such approaches are ill-suited for pattern-matching with evolving patterns (e.g., updating virus signatures), variable searchword lengths, or when a single entity must filter ciphertexts from multiple parties.

In this work, we introduce Searchable Encryption with Shiftable Trapdoors (SEST): a new primitive that allows for pattern matching with universal tokens (usable by all entities), in which keywords of arbitrary lengths can be matched to arbitrary ciphertexts. Our solution uses public-key encryption and bilinear pairings.

In addition, very minor modifications to our solution enable it to take into account regular expressions, such as fully- or partly-unknown characters in a keyword (wildcards and interval/subset searches). Our trapdoor size is at most linear in the keyword length (and independent of the plaintext size), and we prove that the leakage to the searcher is only the trivial one: since the searcher learns whether the pattern occurs and where, it can distinguish based on different search results of a single trapdoor on two different plaintexts.

To better show the usability of our scheme, we implemented it to run DPI on all the SNORT rules. We show that even for very large plaintexts, our encryption algorithm scales well. The pattern-matching algorithm is slower, but extremely parallelizable, and it can thus be run even on very large data. Although our proofs use a (marginally) interactive assumption, we argue that this is a relatively small price to pay for the flexibility and privacy that we are able to attain.

You have full access to this open access chapter, Download conference paper PDF

Public Key Encryption with Flexible Pattern Matching

Privacy-Preserving Pattern Matching on Encrypted Data

Pattern Matching over Encrypted Data with a Short Ciphertext

1 Introduction

Learning whether a given pattern occurs in a larger input string (and where exactly that happens) has many applications, such as when searching on genomic data, in deep-packet inspection (DPI), or when delegating searches in databases. In such cases, the entity performing the search, usually called the gateway, is only semi-trusted by the owner of the input data. Indeed, in all the three scenarios above, it is of paramount importance to preserve the privacy of the input data^{Footnote 1}.

Consider the case of a middlebox, such as a virus scan or a firewall. A user who may trust the middlebox to scan its data for viruses might not, in fact, be comfortable revealing the full contents of its data to that middlebox. Similarly, a person might trust a laboratory to check whether their genome contains a particular substring (indicating, e.g., a genetic predisposition to a disease); however, the laboratory should not, in this way, come into possession of that person’s full genome. Such concerns have been exacerbated lately by threats of mass-surveillance, following the revelations of Edward Snowden. As a consequence, data encryption is slowly becoming an a priori pre-requisite for pattern matching.

In cryptography, pattern matching on encrypted data is closely related to Searchable Encryption, either Symmetric [16,17,18, 32] or Public-Key [9]. Many Searchable Encryption solutions, however, only allow to search for pre-chosen keywords, which are hard-coded in the encrypted input. Searching for a new keyword – not indicated a priori – in that same (already encrypted) data would yield a false negative, even if that keyword is, in fact contained in the input data. Correctly matching the new pattern to the data requires that the latter be re-encrypted. Therefore this solution is ill-suited to more dynamic environments, like DPI. We provide a full comparison with related literature, including searchable encryption, in Sect. 1.2.

Pattern matching with non-static patterns can be achieved through symmetric-key techniques and so-called tokenization [31]. In this approach, a sliding-window technique is used to encode keywords of a given, fixed length, which can then be matched by the searcher. This allows searches to be performed for arbitrarily-chosen keywords; however, a disadvantage is that each instantiation requires a new generation of tokens. Moreover, this only works for a fixed keyword length and different ciphertexts are required to handle different pattern sizes. This is less than ideal for many use-cases such as DPI, since for instance SNORT rules [1] include patterns of many different lengths. In this paper, our goal is to improve on this solution, specifically by allowing to search on encrypted data, with patterns that are non-static (flexible), of variable length, and universal (no need to re-tokenize). In particular, we achieve secure pattern-matching on encrypted data with universal tokens.

1.1 Our Contributions

We opt for a solution in a public-key setting (which immediately achieves universality for our patterns). The gateway will be able to search for keywords on encrypted data using trapdoors that are unforgeable. More specifically, our construction can support pattern matching for keywords that can be adaptively chosen and which can have variable lengths. Moreover, the size of the trapdoors corresponding to those keywords does not depend on the length of the input data (our trapdoors are short, even when we are searching in very large input data). We support regular expressions, such as the presence of wildcards or matching encrypted input to general data-subsets. Thus, our solution is well suited to deep packet inspection or delegated searches on medical data.

Intuitively, in our construction we project each coordinate of the plaintext S (and then of the keyword W) on a geometric basis consisting of some values $z^i$, for $i = 0, \dots , |S|-1$. We prevent malleability of trapdoors by embedding the exact order of the bits of W into a polynomial, which cannot be forged without the secret key. A fundamental part of the searching algorithm that we propose is the way in which the middlebox will be able to shift from one part of the ciphertext to another, when searching for a match with W. Thus, our scheme can be viewed as an anonymous predicate encryption scheme where one could derive the secret keys for $(*,w_1, \ldots , w_\ell , *, \ldots , *)$, $\ldots $, $(*, \ldots , *, w_1,\ldots , w_\ell )$ from the secret key for $(w_1, \ldots , w_\ell , *, \ldots , *)$.

Such changes require the definition of a new primitive that we call Searchable Encryption with Shiftable Trapdoors (SEST). We provide a formal security model for the latter, which ensures that even a malicious gateway knowing trapdoors $\mathsf {td}_{W_1},\ldots ,\mathsf {td}_{W_q}$ does not learn any information from an encrypted string S beyond the presence of the keyword $W_k$ in S, for $k\in [1,q]$.

Our construction is – to our knowledge – the first SEST scheme, and thus can be taken as a proof-of-concept construction. We guarantee the desired properties by only using asymmetric prime order bilinear groups (i.e. a set of 3 groups $\mathbb {G}_1,\mathbb {G}_2$ and $\mathbb {G}_T$ along with an efficient bilinear map $e:\mathbb {G}_1\times \mathbb {G}_2 \rightarrow \mathbb {G}_T$) for which very efficient implementations have been proposed (e.g. [7]). Encryption of plaintexts S only requires operations in the group $\mathbb {G}_1$, while detection of the keyword W is done by performing pairings. The former operation requires only the public key while the latter additionally needs the corresponding trapdoor; only the trapdoor-issuing algorithm requires the corresponding secret key.

We are able to allow for pattern-matching when some of the contents of the keywords are either fully-unknown, i.e., wildcards, or partially-unknown, i.e., in an interval. Searches for such regular expressions remain fully-compatible with our original solution. In the first case, the only difference is that when issuing the trapdoor, instead of fully randomizing it we choose special randomness – equal to 0 – for the “coefficients” of the polynomial that we project the wildcards or unknown subsets to. For the scenario of partially-known trapdoors, we require a more complex key-generation process since we use different values on which to (uniformly) project the unclear values to. These will be used in the trapdoor generation step, ensuring that if a partially-known input is used, that coefficient of the trapdoor will still “vanish”.

In particular, our pattern-matching algorithm is very similar to that of Rabin-Karp and consequently, we can use it to solve similar problems. In addition to the previous use-cases, our technique can also be used to perform 2D pattern matching in images, or searching subtrees in rooted, labelled trees. However, note that due to the privacy-preserving goal of our work, we cannot benefit from many of the tricks used by Rabin-Karp, thus yielding a scheme with limited efficiency.

We also analyze how well our scheme performs when applied to DPI. We implemented our scheme to search for all the SNORT rules in input data of varying sizes. Even for large data, the encryption algorithm is very efficient. Moreover, while the testing (pattern matching) step scales less well with increasing input-data size, that particular step is highly parallelizable, and thus the running time can be much reduced.

Impact and Limitations. Our scheme allows for a flexible searchable encryption mechanism, in which encrypters do not have to embed a list of possible keywords into their ciphertexts. Moreover, we also provide a great deal of flexibility with respect to searching for keywords of arbitrary lengths. In this sense, our technique allows for searchable encryption with universal tokens, which can be used in deep-packet inspection, applications on genomic and medical data, or matching subtrees in labelled trees.

One limitation of our scheme is the size of our public keys. We require a public key of size linear in the size of the plaintext to be encrypted (which is potentially very large). This is mostly due to the need to shift the ciphertext each time in order to detect the presence of the keyword. We also require a large ciphertext, consisting of a number of elements that is again linear in the size of the plaintext; however, the same inefficiency is inherent also to solutions such as BlindBox [31], in which we must encrypt many “windows” of the data, of same size. Finally, the search of a keyword of size $\ell $ in a plaintext of size n requires at least $2(n-\ell +1)$ pairing computations.

Furthermore, we are only able to prove the security of our construction under an interactive assumption, unless we severely restrict the size n of the message space. Indeed, we need an assumption which offers enough flexibility to provide shiftable trapdoors for all possible keywords except the one that allow trivial distinction of the encrypted string. We modify the $\mathsf {GDH}$ assumption [8] in a minimal way, to allow the adversary to request the values on which the reduction will break this assumption. We could remove the need for this flexibility, by, for instance reducing the value of n so that the simulator could guess the strings targeted by the adversary but this strongly limits the applications of our construction.

We argue that despite this interactive assumption, the intrinsic value of our construction lies in its flexibility, namely in the fact that we are able to search for arbitrary keywords. This significantly improves existing solutions of, e.g., detecting viruses on encrypted traffic over HTTPS [24, 25, 31].

Moreover, we emphasize that we achieve this high level of flexibility without using complex (and costly) cryptographic tools such as fully homomorphic encryption. We simply need pairings which have become quite standard in cryptography and which can be implemented very efficiently [7]. We therefore argue that our scheme, when compared to solutions providing the same features (see Sect. 1.3 for more details), offers a practical improvement over the state of the art.

1.2 Related Work

How Searchable Encryption Works. In searchable encryption (SE) [9, 16,17,18, 32], any party that is given a trapdoor $\mathsf {td}_W$ associated with a keyword W is able to search for that keyword within a given ciphertext. The ideal privacy guarantee required is that searching reveals nothing else on the underlying plaintext (other than the presence or absence of the keyword). Routing encrypted emails, querying encrypted database or running an antivirus on encrypted traffic are typical applications which require such a functionality.

In general, SE searches are usually performed by the middlebox on keywords that have been pre-chosen by the party encrypting the ciphertexts (i.e., the encrypter). In particular, an encrypted string containing W can be detected by the middlebox knowing $\mathsf {td}_W$ only if the sender has selected W as a keyword and has encrypted it using the SE scheme. Such approaches are still suitable for some types of database searches (in which documents are already indexed by keywords), or in the case of emailing applications – for which natural keywords can be the sender’s identity, the subject line, or flags such as “urgent”. Unfortunately, in cases such as messaging applications, or just for common Internet browsing, the keywords are much harder to find, and can include expressions that are not sequences of words per se, but rather something of the kind “http://www.example.com/index.php?username=1”.

Our solution allows for better flexibility in terms of searching for arbitrarily-chosen keywords, even after the plaintext has been encrypted and sent. In fact, it is not even necessary that the encrypter be the same person as the party which issues the trapdoors. This makes our solution much better suited to DPI scenarios, whereas SE is typically better suited to database searches.

Tokenization. The solution proposed in [31] to search keywords of length $\ell $ is to split the string $S=s_0\ldots s_{n-1}$ into $[s_0\ldots s_{\ell -1}]$, $[s_1\ldots s_{\ell }]$, $\ldots $, $[s_{n-\ell }\ldots s_{n-1}]$ and then to encrypt each of these substrings using a searchable encryption scheme (the substrings are thus the keywords associated with S). However, this solution has a drawback: it works well if all the searchable keywords $W_1,\ldots ,W_q$ have the same length but this is usually not the case. In the worst case, if all searchable keyword $W_k$ are of different length $\ell _k$, the sender will have, for each $k\in [1,q]$, to split S in substrings of size $\ell _k$ and encrypt them, which quickly becomes cumbersome. One solution could be to split the searchable keywords $W_k$ into smaller keywords of the same length $\ell _{min} = min_k (\ell _k)$. For example, if $\ell _{min}=3$ the searchable keyword “execute” could be split into “exe”, “cut” and “ute” for which specific trapdoors would be issued. Unfortunately, this severely harms privacy since these smaller keywords will match many more strings S. Moreover, repeating this procedure for every keyword $W_k$ will allow the gateway to receive trapdoors for a large fraction of the set of strings of length $\ell _{min}$ and so to recover large parts of S with significant probability.

We note that Canard et al. [14] recently proposed a public key variant of the Blindbox [31] approach which therefore suffers from the same limitations. Moreover, their performance corresponds to the “delimiter-based” version of their protocol that consists in splitting a string $s=s_0\ldots s_{n-1}$ into t substrings $[s_0\ldots s_{n_1 -1}]$, $[s_{n_1}\ldots s_{n_2-1}]$, ..., $[s_{n_{t-1}}\ldots s_{n-1}]$ which are then independently encrypted using searchable encryption. While this dramatically reduces complexity, we stress that this only allows to detect patterns that perfectly match one of the substrings. In particular, a pattern cannot be detected if it straddles two substrings.

By contrast, our scheme addresses the main drawback of this tokenization technique: we allow for universal trapdoors of arbitrary length to be matched against the encrypted data, without false negatives or positives. This comes at a cost in performance; however, we show in our implementation that our scheme remains practical.

Generic Evaluation of Functions on Ciphertexts. Evaluation of functions over encrypted data is a major topic in cryptography, which has known very important results over the past decade. Generic solutions (e.g., fully homomorphic encryption [22], functional encryption [3, 4], etc.), supporting a wide class of functions, have been proposed; however, their very high complexity makes such solutions impractical. In practice, it is then better to use a scheme specifically designed for the function(s) that one wants to evaluate.

Several recent publications study secure substring search and text processing [5, 21, 23, 26, 28, 29, 33], specifically in two-party settings. Some of these papers provide applications to genomic data, specifically matching substrings of DNA to encrypted genomes. This was done by using secure multi-party computation or fully-homomorphic encryption. However, the former solution requires interaction between the searcher and the encrypter, whereas the use of FHE induces a relatively high complexity. Of particular interest here is the approach by Lauter et al. [28], which presents an application to genomic data. The authors here go much further than just matching patterns with some regular expressions, however, they require fully-homomorphic encryption (FHE) for their applications. We leave it as future work to investigate in how far we can modify our technique with universal tokens in order to provide some support to the algorithms presented by Lauter et al. for genomic matching.

At first sight, anonymous predicate encryption (e.g. [27]) or hidden vector encryption [11] provide an elegant solution to the problem of searching on encrypted streams. Indeed, the sender could use one of these schemes to produce a ciphertext for some attributes $s_0,\ldots ,s_{n-1}$ which together make up a word S, while the middlebox, knowing the suitable secret keys, could detect whether S contains a substring W. The encryption process would then not depend on the searchable keywords and the anonymity property of these schemes would ensure that the ciphertext does not leak more information on S.

However, another issue arises with this solution. Indeed, $W=w_1\ldots w_\ell $ can be contained at any position in S. Therefore, the gateway should receive the secret keys for $(w_1,\ldots ,w_\ell ,*,\ldots ,*)$, $(*,w_1, \ldots , w_\ell , *, \ldots , *)$, $\ldots $, $(*, \ldots , *, w_1, \ldots , w_\ell )$, where “$*$” plays the role of a wildcard, to take into account all the possible offsets. So, for each searchable keyword of size $\ell $, the gateway would have to store $n-\ell +1$ keys, which is obviously a problem for large strings S.

DPI with Multi-context Key-Distribution. Naylor et al. [30] recently presented a multi-context key-exchange over the TLS protocol, which aims to allow middleboxes (read, write, or no) access to specific ciphertext fragments that they are entitled to see. This type of solution has some important merits, such as the fact that it is relatively easy to put into practice and allows the middlebox to perform its task with a very low overhead (the cost of a simple decryption). In addition, the parties sending and receiving messages need not deviate from the protocols they employ (such as TLS/SSL).

However, such solutions also have important disadvantages. The first of these is that the privacy they offer is not ideal. Instead of simply learning whether a specific content is contained within a given message or not, the middlebox learns entire chunks of messages. Moreover, the access-control scheme associated to the key-exchange scheme is relatively inflexible. The middlebox is given read or write access to a number of message fragments, and this is not easily modifiable (except by running the key-distribution algorithm once more). Finally, despite the efficiency of the search step (once the key-repartition is done), the finer-grained the access control is – thus offering more privacy – the more keys will have to be generated and stored by the various participating entities.

1.3 Benefits of SEST

Pattern matching on encrypted data is a very frequently-encountered problem, which can be addressed by many different primitives. In this context, the benefits of our new primitive (SEST) might not seem obvious. To better understand the intrinsic differences between all these approaches, we provide in Fig. 1 a comparison of their asymptotic complexities. We choose to only consider the most relevant alternatives, namely Searchable Encryption (both Symmetric and Public-Key) and Predicate Encryption/Hidden Vector Encryption. Other solutions do exist, as explained above; however, they induce high complexity, interactivity or weaker privacy.

As we explained, searching substrings at any position using SSE or ASE requires a tokenization process which must be repeated for each possible length of keyword, hence the $O(n\cdot L)$ size of the ciphertext. ASE performance is an adaptation of the tokenization idea of BlindBox to the Public Key Encryption with Keyword Search of Boneh et al. [9].

Conversely, PE and HVE offer a O(n) complexity for the ciphertext but at the cost of generating and storing $n\cdot q$ trapdoors (to handle any possible offset).

We therefore argue that SEST is an interesting middle way which almost provides the best of the previous two types. Its only drawback compared to SSE and to ASE is the size of the public parameters but we believe this is a reasonable price to pay to achieve all the other features.

1.4 Pattern Matching and Privacy

At first sight, the ability to search patterns within a ciphertext may seem harmful to users’ privacy, compared to standard end-to-end encryption. However, we stress that it is a lesser evil in many use-cases.

For example, in current solutions for DPI [25], the middlebox acts as a man in the middle to decrypt all traffic, which means that end-to-end encryption is gone anyway. Using SEST, the users can at least control which information can be leaked from their traffic since they are the only ones who can issue trapdoors. In particular, they can check that the keywords submitted by the middlebox are legitimate. For example, as we describe in Sect. 6.2, they could agree to issue trapdoors only for patterns associated to malwares, using public rules such as the ones provided by SNORT [1].

More generally, the incompatibility of standard encryption with any data processing often jeopardizes users’ privacy since it gives no other choice than complete decryption of the traffic. We therefore argue that SEST is far from being a threat to privacy and can actually be used to improve it.

Outline. Our paper has the following structure. We begin in Sect. 2 by formally defining our new primitive, Searchable Encryption with Shiftable Trapdoors (SEST). Then, in Sect. 3, we describe an instantiation of this primitive, which relies on public-key encryption and bilinear pairings. In Sect. 4, we describe under which assumptions our scheme achieves provable security, and provide a security proof. We then describe how our construction can be used to handle regular expressions (wildcards and value intervals) in Sect. 5. Handling regular expressions is important in real-world applications, including DPI. In Sect. 6 we discuss the efficiency of our protocol and provide implementation results for pattern matching of all the SNORT rules on encrypted data of various sizes. Finally, we discuss our results and make some concluding remarks in Sect. 7.

2 Searchable Encryption with Shiftable Trapdoors

We begin by presenting the syntax of our SEST primitive. Note that in addition to indicating whether the keyword was found in the (encrypted) plaintext, this scheme also outputs the position(s) at which the keyword is found. This is one advantage of shiftable trapdoors^{Footnote 2}, namely yielding the exact position, within the target plaintext, of the search word. Such a knowledge is indeed necessary for some use-cases (see Sect. 6.2).

To keep our model as general as possible we consider strings $S = s_0\ldots s_{m-1}$ whose characters $s_i$ belong to a finite set $\mathcal {S}$. Since $\mathcal {S}$ is finite, we may assume that each of its elements s can be simply indexed by a unique integer f(s) between 0 and $|\mathcal {S}|-1$. For sake of simplicity, we will omit in the following the function f and will then directly use s as an index (for example T[f(s)] will be denoted by T[s]).

2.1 Syntax

A searchable encryption scheme with shiftable trapdoors is defined by 5 algorithms that we call $\mathtt {Setup}$, $\mathtt {Keygen}$, $\mathtt {Issue}$, $\mathtt {Encrypt}$ and $\mathtt {Test}$. The first three of these are run by an entity called the receiver, while $\mathtt {Encrypt}$ is run by a sender and $\mathtt {Test}$ by a gateway.

$\mathtt {Setup}(1^k,n)$: This probabilistic algorithm takes as input a security parameter k and an integer n defining the maximum size of the strings that one can encrypt. It returns the public parameters pp that will be taken in input by all the other algorithms. In the following, pp will be considered as an implicit input to all algorithms and so will be omitted.
$\mathtt {Keygen}(\mathcal {S})$: This probabilistic algorithm run by the receiver takes as input a finite set $\mathcal {S}$ and returns a key pair $(\mathsf {sk},\mathsf {pk})$. The former value is secret and only known to the receiver, while the latter is public.
$\mathtt {Issue}(W,\mathsf {sk})$: This probabilistic algorithm takes as input a string W of any size $0<\ell \le n$, along with the receiver’s secret key, and returns a trapdoor $\mathsf {td}_W$.
$\mathtt {Encrypt}(S,\mathsf {pk})$: This probabilistic algorithm takes as input the receiver’s public key along with a string $S=s_0\ldots s_{m-1}$ of size $0<m\le n$ such that $s_i\in \mathcal {S}$ for all $i\in [0,m-1]$ and returns a ciphertext C.
$\mathtt {Test}(C,\mathsf {td}_W)$: This deterministic algorithm takes as input a ciphertext C encrypting a string $S=s_0\ldots s_{m-1}$ of size m along with a trapdoor $\mathsf {td}_W$ for a string $W=w_0\ldots w_{\ell -1}$ of size $\ell $. If $m>n$ or $\ell >m$, then the algorithm returns $\perp $. Else, the algorithm returns a set (potentially empty) $\mathcal {J}\subset \{0, m-\ell \}$ of indexes j s.t. $s_{j}\ldots s_{j+\ell -1} =w_0\ldots w_{\ell -1}$.

Remark 1

Notice that searchable encryption, e.g., [2, 11], usually does not consider a decryption algorithm which takes as input $\mathsf {sk}$ and a ciphertext C encrypting S and which returns S. Indeed, this functionality can easily be added by also encrypting S under a conventional encryption scheme. Nevertheless, one can note that decryption can be performed by issuing a trapdoor for all characters $s\in \mathcal {S}$ and running the $\mathtt {Test}$ algorithm on C for each of them.

2.2 Security Model

Correctness. As in [2], we divide correctness into two parts. The first one stipulates that the $\mathtt {Test}$ algorithm run on $(C,\mathsf {td}_W)$ will always return j if S contains the substring W at index j (no false negatives). More formally, this means that, for any string S of size $m\le n$ and any W of length $\ell \le m$: whenever $s_{j}\ldots s_{j+\ell -1} =w_0\ldots w_{\ell -1} $,

$$\mathtt {Pr}[j\in \mathtt {Test}(\mathtt {Encrypt}(S,\mathsf {pk}),\mathtt {Issue}(W,\mathsf {sk}))] = 1,$$

where the probability is taken over the choice of the pair $(\mathsf {sk},\mathsf {pk})$.

The second part of the correctness property requires that false positives (i.e., when the $\mathtt {Test}$ algorithm returns j despite the fact $s_{j}\ldots s_{j+\ell -1} \ne w_0\ldots w_{\ell -1}$) only occur with negligible probability. More formally, this means that, for any string S of size $m\le n$ and any string W of length $\ell \le m$:

$$ \mathtt {Pr}\bigg [ \begin{array}{c} j\in \mathtt {Test}(\mathtt {Encrypt}(S,\mathsf {pk}),\mathtt {Issue}(W,\mathsf {sk})) \\ \& \ \ s_{j}\ldots s_{j+\ell -1} \ne w_0\ldots w_{\ell -1} \end{array} \bigg ]\le \mu (k)$$

where $\mu $ is a negligible function.

Indistinguishability (SEST-IND-CPA). For the security requirement of Searchable Encryption with Shiftable Trapdoors (SEST), we adapt the standard notion of IND-CPA to this case (hence the name SEST-IND-CPA). Informally, this notion requires that no adversary $\mathcal {A}$, even with access to an oracle $\mathcal {O}\mathtt {Issue}$ which returns a trapdoor $\mathsf {td}_W$ for any queried string W, can decide whether a ciphertext C encrypts $S_0$ or $S_1$ as long as the trapdoors issued by the oracle do not allow trivial distinction of these two strings. This is formally defined by the experiment $\mathtt {Exp}_\mathcal {A}^{ind-cpa-\beta }(1^k,n)$, where $\beta \in \{0,1\}$ as described in Fig. 2. The set $\mathcal {W}$ is the set of all the strings W submitted to $\mathcal {O}\mathtt {Issue}$.

We define the advantage of such an adversary as $\mathtt {Adv}^{ind-cpa}_\mathcal {A}(1^k,n)= |\Pr [\mathtt {Exp}_\mathcal {A}^{ind-cpa-1}(1^k,n)]\ -\ \Pr [\mathtt {Exp}_\mathcal {A}^{ind-cpa-0}(1^k,n)]|$. A searchable encryption scheme with shiftable trapdoors is SEST-IND-CPA secure if this advantage is negligible for any polynomial-time adversary.

We note that this security notion is very similar to the attribute hiding property of predicate encryption [27]. However, we cannot directly use this latter property because of the differences between predicate encryption and our primitive (e.g., the lack of decryption algorithm), hence the need for a new security game.

The restriction in step 6 simply ensures that if $S_i$ contains $W\in \mathcal {W}$ at offset j, then this is also the case for $S_{1-i}$. Otherwise, running the $\mathtt {Test}$ algorithm on $(C,\mathsf {td}_W)$ would enable $\mathcal {A}$ to trivially win this experiment.

Although this kind of restriction is very common in predicate/functionnal encryption schemes (e.g. [27]), we stress that, in practice, one must take care that it does not lead to situations where security becomes meaningless. For example, if the adversary gets a trapdoor for every character $s\in \mathcal {S}$, then it will always fail the experiment (it will not be able to output two strings $S_0$ and $S_1$ complying with the requirement of step 6) while being able to decrypt any ciphertext (see Remark 1).

This example highlights the implicit restrictions placed on the set of trapdoors. This is obviously a limitation of the security model (that also applies to all predicate or searchable encryption schemes) but we believe that these restrictions are very hard to formalize and should rather be considered on a case-by-case basis. For example, in the context of DPI, the receiver could assess once and for all the set of rules to check that the leakage remains reasonable.

Selective-Indistinguishability (SEST-sIND-CPA). We also need a weaker security notion in which the adversary commits to $S_0$ and $S_1$ at the beginning of the experiment, before seeing pp and $\mathsf {pk}$. Such a restriction is quite standard and is usually referred to as selective security [15].

Remark 2

We recall that in a public-key setting, it is always possible to recover W from $\mathsf {td}_W$: one simply has to encrypt the $2^{|W|}$ strings of size |W| and then run $\mathtt {Test}(.,\mathsf {td}_W)$ on each resulting ciphertext. The correctness property ensures (with overwhelming probability) that one will always get an empty set, except for the encryption of W.

Therefore, unless we place restrictions on the set of keywords that one can query (in particular on its min-entropy, as in [10]), we cannot achieve relevant privacy notions for the trapdoor $\mathsf {td}_W$ itself. However, this is not a problem for, say, deep-packet inspection, in which many of the keywords can even be public [1].

Finally, we note that one can achieve interesting privacy notions for the trapdoors in the private-key setting (e.g. [13]).

3 Our Construction

We are able to construct our SEST scheme by “projecting” both the keyword and the plaintext onto a multiplicative basis of the type $z^i$ for some secret integer z. We encrypt the plaintext character-by-character, using secret encodings $\alpha _s$ for each $s\in \mathcal {S}$. The latter are also used to generate the trapdoors associated with the keyword. By using a bilinear mapping we are able to shift into the ciphertext and compare a given fragment of suitable length to the trapdoor.

Note that in order to achieve the security notion of SEST-(s)IND-CPA, we need to at least guarantee that, given some trapdoors $\mathsf {td}_{W_i}$ for words $W_i$, the adversary is not able to forge a trapdoor for some fresh word $W^*$. By projecting keywords on a polynomial in a secret value z, we ensure that trapdoors on keywords W are essentially un-malleable.

We describe our construction in detail in what follows, prefacing our scheme by a brief introduction to bilinear groups and pairings.

3.1 Bilinear Groups

Bilinear groups are a set of three cyclic groups, $\mathbb {G}_1$, $\mathbb {G}_2$, and $\mathbb {G}_{T}$, of prime order p, along with a bilinear map $e: \mathbb {G}_1 \times \mathbb {G}_2 \rightarrow \mathbb {G}_T$ with the following properties:

1.
for all $g\in \mathbb {G}_1, \widetilde{g}\in \mathbb {G}_2$ and $a,b \in \mathbb {Z}_p$, $e(g^a,\widetilde{g}^b)=e(g,\widetilde{g})^{a\cdot b}$;
2.
for any $g \ne 1_{\mathbb {G}_1}$ and $\widetilde{g}\ne 1_{\mathbb {G}_2}$, $e(g,\widetilde{g}) \ne 1_{\mathbb {G}_T}$;
3.
the map e is efficiently computable.

Galbraith, Paterson, and Smart [20] defined three types of pairings: in type 1, $\mathbb {G}_1=\mathbb {G}_2$; in type 2, $\mathbb {G}_1\ne \mathbb {G}_2$ but there exists an efficient homomorphism $\phi : \mathbb {G}_2\rightarrow \mathbb {G}_1$, while no efficient one exists in the other direction; in type 3, $\mathbb {G}_1\ne \mathbb {G}_2$ and no efficiently computable homomorphism exists between $\mathbb {G}_1$ and $\mathbb {G}_2$, in either direction.

The security of our construction holds as long as no efficient homomorphism exists from $\mathbb {G}_1$ to $\mathbb {G}_2$. Our system must therefore be instantiated with pairings of type 2 or 3. However, in the following, we will only consider the latter type since it allows simpler security proofs thanks to the separation between the two groups $\mathbb {G}_1$ and $\mathbb {G}_2$. We stress that this is not a significant restriction since type 3 pairings offer the best performances among the three types.

3.2 Intuition

Intuitively, our scheme associates each element s of $\mathcal {S}$ with a secret encoding $\alpha _s$. A trapdoor for a string $w_0\ldots w_{\ell -1}$ is associated with a polynomial $V = \sum _{i=0}^{\ell -1} v_i \cdot \alpha _{w_i} \cdot z^i$ where $v_i$ are random secret scalars whose purpose is to prevent forgeries of new trapdoors. The trapdoor then consists in the elements $\widetilde{g}^V$ and $\widetilde{g}^{v_i}$ for $i=0,\ldots , \ell -1$. In the meantime, a ciphertext encrypting a string $s_0\ldots s_{n-1}$ is the sequence of “monomials” $C'_j = g^{a\cdot \alpha _{s_j}\cdot z^j}$ where a is a random factor (the Keygen algorithm will ensure that this can be done by only using elements from the public key). By using the bilinear map e, one can derive from the ciphertext and the trapdoor elements of the form $e(g,\widetilde{g})^{U}$ where U is a polynomial whose coefficients depends on the encodings $\alpha _{s_i}$ and on the scalars $v_i$.

In this encoding, if $s_0\ldots s_{n-1}$ contains the pattern $w_0\ldots w_{\ell -1}$ at offset j (i.e. if $s_{j+i} = w_{i}$ for $i=0,\ldots ,\ell -1$) one can generate $e(g,\widetilde{g})^{U} = \prod _{i=0}^{\ell -1} e(C'_{j+i},\widetilde{g}^{v_i})$ where $U = a\cdot z^j \cdot V$. Therefore, by extending the ciphertext with the elements $C_j=g^{a\cdot z^j}$, one can simply test the presence of W. By contrast, a difference $s_{j+i}\ne w_i$ or the combination of non-successive ciphertext elements will lead to a random-looking polynomial which would be useless to the adversary.

However, using this solution to search for a pattern of length $\ell $ within a string of length m requires $(\ell +1)(m-\ell +1)$ pairings, which quickly becomes prohibitive. While it seems natural that the complexity depends on the size m (since we have to search at every position), one could hope to reduce the factor $(\ell +1)$.

A first attempt could be to set $v_i=v$ for all $i\in [0,\ell -1]$ for some secret scalar v. Indeed, thanks to the bilinearity of e, the $\ell $ pairings $\prod _{i=0}^{\ell -1} e(C'_{j+i},\widetilde{g}^{v_i})$ could be replaced by only one: $e(\prod _{i=0}^{\ell -1}C'_{j+i},\widetilde{g}^{v})$. Unfortunately, such a solution is insecure as proven by the following example.

Let C be a ciphertext encrypting a string $S=s_0\ldots s_{m-1}$ and let us assume that W is a keyword such that $w_i=s$ for all $i\in [0,\ell -1]$ (i.e. W is a sequence of identical values, equal to s). Then, for any $0<j\le \ell -1$

$$e(C_0\cdot C_{j}^{-1},\widetilde{g}^{V_W}) = e(g,\widetilde{g})^{a(1-z^{j})V_W} =e(g,\widetilde{g})^{aV'},$$

with

$$ V' = \sum _{k=0}^{j-1} v\cdot \alpha _s \cdot z^k -\sum _{k=\ell }^{\ell +j -1} v\cdot \alpha _s \cdot z^k. $$

Therefore, $e(g,\widetilde{g})^{aV'}$ can be used to check whether

$$s_0\ldots s_{j-1} = \overbrace{s\ldots s}^{j\; \text {times} }\ \wedge \ s_{\ell }\ldots s_{\ell +j -1} = \overbrace{1\ldots 1}^{j\; \text {times}}.$$

Using $\mathsf {td}_W$, a gateway is then able to get more information on S than the presence of W as a substring, which breaks the security of the construction.

However, this attack does not mean that we necessarily have to select different scalars $v_i$ but simply that the generation process needs to be more subtle. We indeed prove that one can “recycle” the random elements $v_i$ within the same trapdoor without jeopardizing security. More specifically, the issuing process that we describe in the next section is based on the observation that the secret encodings $\alpha _s$ already add some variability to the coefficients of the polynomial V. This therefore means that this variability need not exclusively rely on the random scalars $v_i$. In particular when $w_i\ne w_j$, the coefficients $v_i\cdot \alpha _{w_i}$ and $v_j\cdot \alpha _{w_j}$ will be different even if $v_i= v_j$. In such a case, there is no need to chose distinct scalars, which allows us to batch the corresponding pairings for the test. Compared to the solution with random scalars $v_i$, this divides the whole number of pairings by up to $|\mathcal {S}|$ (e.g., 256 if we consider bytestrings).

3.3 The Protocol

$\mathtt {Setup}(1^k,n)$: Let $(\mathbb {G}_1,\mathbb {G}_2,\mathbb {G}_T,e)$ be the description of type 3 bilinear groups of prime order p, this algorithm selects $g{\mathop {\leftarrow }\limits ^{{}_{\$}}}\mathbb {G}_1$ and $\widetilde{g}{\mathop {\leftarrow }\limits ^{{}_{\$}}}\mathbb {G}_2$ and returns $pp\leftarrow ( \mathbb {G}_1,\mathbb {G}_2,\mathbb {G}_T,e,g,\widetilde{g},n)$.
$\mathtt {Keygen}(\mathcal {S})$: On input a finite set $\mathcal {S}$, this algorithm selects $|\mathcal {S}| +1$ random scalars $z,\{\alpha _s\}_{s\in \mathcal {S}}$ and computes $g_i\leftarrow g^{z^i}$ along with $\{g_i^{\alpha _s}\}_{s\in \mathcal {S}}$ for $i=0,\ldots , n-1$. The public key $\mathsf {pk}$ is set as $\{(g_i,\{g_i^{\alpha _s}\}_{s\in \mathcal {S}})\}_{i=0}^{n-1}$ whereas $\mathsf {sk}$ is set as $(z,\{\alpha _s\}_{s\in \mathcal {S}})$.
$\mathtt {Encrypt}(S,\mathsf {pk})$: To encrypt a string $S = s_0\ldots s_{m-1}$, where $m\le n$ the user selects a random scalar a and returns $C = \{(C_i,C'_i)\}_{i=0}^{m-1}$, where $C_i \leftarrow g_{i}^a$ and $C'_i \leftarrow g_i^{a\cdot \alpha _{s_i}}$ for $i=0\ldots m-1$.
$\mathtt {Issue}(W,\mathsf {sk})$: To issue a trapdoor $\mathsf {td}_W$ for a string $W = w_0\ldots w_{\ell -1}$ of length $\ell \le n$, one uses the following algorithm.
Our Issue algorithm formalizes the following principle: the random scalars (stored in L) can be re-used as long as the coefficients of the polynomial V are all distinct. In particular, if we write V as $\sum _{i=0}^{\ell -1} v_i\cdot \alpha _{w_i} \cdot z^i$, then $v_i\ne v_j$ if $w_i =w_j$.
$\mathtt {Test}(C,\mathsf {td}_W)$: To test whether the string S encrypted by C contains the substring W, the algorithm parses $\mathsf {td}_W$ as $(c,\{\mathcal {I}_j\}_{j=0}^{c-1},\{\widetilde{g}^{L[j]}\}_{j=0}^{c-1},\widetilde{g}^V)$ and C as $\{(C_i,C'_i)\}_{i=0}^{m-1}$ and checks, for $j=0,\ldots , m - \ell $, if the following equation holds:
$$\begin{aligned} \prod \nolimits _{t=0}^{c-1} e(\prod \nolimits _{i\in \mathcal {I}_t} C'_{j+i},\widetilde{g}^{L[t]}) = e(C_j,\widetilde{g}^V). \end{aligned}$$
It then returns the (potentially empty) set $\mathcal {J}$ of indexes j for which there is a match.

Correctness. First note that, if S contains the substring W at index j (i.e., $s_{j+i} = w_{i}$ $\forall i=0,\ldots ,\ell -1$), then:

$$\begin{aligned} \prod _{t=0}^{c-1} e(\prod _{i\in \mathcal {I}_t} C'_{j+i},\widetilde{g}^{L[t]})&= \prod _{t=0}^{c-1} e(\prod _{i\in \mathcal {I}_t} g^{a\cdot \alpha _{s_{j+i}}\cdot z^{j+i}},\widetilde{g}^{L[t]})\\&= \prod _{t=0}^{c-1} e( g^{a},\widetilde{g}^{L[t]\cdot \sum _{i\in \mathcal {I}_t} \alpha _{w_{i}}\cdot z^{j+i}})\\&= \prod _{t=0}^{c-1} e( g^{a},\widetilde{g}^{\sum _{i\in \mathcal {I}_t} L[t]\cdot \alpha _{w_{i}}\cdot z^{j+i}})\\&= e(g,\widetilde{g})^{a\cdot z^j\cdot V} = e(C_j,\widetilde{g}^V) \\ \end{aligned}$$

The set $\mathcal {J}$ returned by $\mathtt {Test}$ contains j.

Now, let us assume that $\mathcal {J}$ contains j but that $s_{j}\ldots s_{j+\ell -1} \ne w_{0} \ldots w_{\ell -1}$, i.e., the algorithm returns a false positive. Let $\mathcal {I}_{\ne }$ be the (non-empty) set of indexes i such that $s_{j+i} \ne w_i$. For all $i\in [0,\ell -1]$, we define $v_i=L[t_i]$ where $t_i$ is such that $i\in \mathcal {I}_{t_i}$. Since j has been returned by $\mathtt {Test}$, we have,

Since $\alpha _{s_{j+i}}\ne \alpha _{w_i}$ for all $i\in \mathcal {I}_{\ne }$, this amounts to evaluating the probability that a random scalar z is a root of a non-zero polynomial of degree at most $\ell -1$. The probability that $\mathtt {Test}$ returns a false positive j is thus at most $\frac{\ell -1}{p}$, which is negligible.

Remark 3

Our construction achieves the goals that we define at the beginning of Sect. 1.1. Indeed, the $\mathtt {Encrypt}$ procedure does not depend on the keywords W, and the latter may have distinct lengths. In particular, the size of C only depends on the length of the message it encrypts. Moreover, the trapdoors $\mathsf {td}_W$ allow to search the word W in $S=s_0\ldots s_{m-1}$ at any possible offset, while being of size independent of m.

All these features are provided using only asymmetric prime order bilinear groups, which can be very efficiently implemented on a computer (e.g., [7]). We refer to Sect. 6 for a more thorough analysis of the efficiency of our protocol.

Remark 4

As explained in Sect. 2.1, public-key searchable encryption schemes often assume that the sender will also encrypt the string S by using a conventional encryption scheme $\varPi $. Such a solution enables fast decryption but should be used cautiously in some contexts, such as DPI, where the sender is likely to be malicious. Indeed, nothing prevents the latter from encrypting an harmless string S using the searchable encryption scheme while encrypting a different $S'$ using $\varPi $. The message (S) checked by the gateway would then be different from the one forwarded to the receiver ($S'$), which would make the inspection pointless.

It is therefore necessary to check that both ciphertexts decrypt to the same string S, which can easily be done by the receiver. Indeed, after decrypting the conventional ciphertext, the latter (who knows $\mathsf {sk}$) can verify whether $\{(C_i,C'_i)\}_{i=0}^{m-1}$ encrypts $S=s_0\ldots s_{m-1}$ by testing if $C'_i = C_i^{\alpha _{s_i}}$ for $i\in [0,m-1]$. One can also perform such tests only for a limited number $N\le m$ of indexes i, but the probability of detecting cheating sender will become $\frac{N}{m}$.

4 Security Analysis

4.1 Complexity Assumptions

Let us consider an adversary $\mathcal {A}$ which, knowing q trapdoors $\mathsf {td}_{W_k}$, would like to decide if a ciphertext C encrypts $S_0$ or $S_1$. The natural restrictions imposed by the security model imply that there is at least one index $i^*$ such that $s_{i^*}^{(0)}\ne s^{(1)}_{i^*}$ and that, for all $k\in [1,q]$ and all $j\in [0,\ell _k-1]$ (where $\ell _k$ is the length of $W_k$), $ s_{i^*-\ell _k+1+j}^{(0)}\ldots s_{i^*+j}^{(0)}$ and $ s^{(1)}_{i^*-\ell _k+1+j}\ldots s^{(1)}_{i^*+j}$ both differ from $w_{k,0},\ldots , w_{k,\ell _k-1}$. In other words, any substring of $S_0$ (or respectively $S_1$) of length $\ell _k$ containing $s_{i^*}^{(0)}$ (resp. $s^{(1)}_{i^*}$) must be different from $W_k$, for all $k\in [1,q]$.

If we focus on the index $i^*$, $\mathcal {A}$ must then distinguish whether the discrete logarithm of $C'_{i^*}$ in base $g_{i^*}$ is $a\cdot \alpha _{s_{i^*}^{(0)}}$ or $a\cdot \alpha _{s_{i^*}^{(1)}}$. To this end, the attacker has access to many elements of $\mathbb {G}_1$ (the public parameters and the other elements of the ciphertext) and of $\mathbb {G}_2$ (the trapdoors $\mathsf {td}_{W_k}$). All of them are of the form $g^{P_u(a,\alpha _s,z)}$ or $\widetilde{g}^{Q_v(\alpha _s,z,{v_{i,k}}})$ for a polynomial number of multivariate polynomials $P_u$ and $Q_v$. The assumption underlying the security of our scheme is thus related to the General Diffie-Hellman $\mathsf {GDH}$ problem [8], whose asymmetric version [12] is recalled below.

Definition 1

($\mathsf {GDH}$ assumption). Let r, s, t and c be four positive integers and $\mathtt {R}\in \mathbb {F}_p[X_1,\ldots ,X_c]^r$, $\mathtt {S}\in \mathbb {F}_p[X_1,\ldots ,X_c]^s$, and $\mathtt {T}\in \mathbb {F}_p[X_1,\ldots ,X_c]^t$ be three tuples of multivariate polynomials over $\mathbb {F}_p$. Let $R^{(i)}, S^{(i)}$ and $T^{(i)}$ denote the i-th polynomial contained in $\mathtt {R}$, $\mathtt {S}$, and $\mathtt {T}$. For any polynomial $f\in \mathbb {F}_p[X_1,\ldots ,X_c]$, we say that f is dependent on ${<}\mathtt {R},\mathtt {S},\mathtt {T}{>}$ if there are $\{a_j\}_{i=1}^s\in \mathbb {F}_p^s\setminus \{(0,\ldots ,0)\}$, $\{b_{i,j}\}_{i,j=1}^{i=r,j=s}\in \mathbb {F}_p^{r\cdot s}$ and $\{c_k\}_{k=1}^{t}\in \mathbb {F}_p^t$ such that

$$f(\sum _{j} a_j S^{(j)}) = \sum _{i,j} b_{i,j}R^{(i)}S^{(j)} + \sum _k c_k T^{(k)}.$$

Let $(x_1,\ldots ,x_c)$ be a secret vector. The $\mathsf {GDH}$ assumption states that, given the values $\{g^{R^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^r$, $\{\widetilde{g}^{S^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^s$ and $\{e(g,\widetilde{g})^{T^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^t$, it is hard to decide whether $U = g^{f(x_1,\ldots ,x_c)}$ or U is random if f is independent of ${<}\mathtt {R},\mathtt {S},\mathtt {T}{>}$.

Unfortunately, we cannot directly make use of this assumption unless we severely restrict the size n of the strings that one can encrypt. In our proof, presented in Sect. 4.2, one of the main important steps is showing that, even given a number of keyword trapdoors (and in particular, the polynomials V associated with those keywords), the adversary is unable to detect the presence of a fresh keyword; consequently, we can bound the leakage on the input plaintexts by only considering the adversary’s queries to the issuing oracle. This can be mapped to an instance of $\mathsf {GDH}$, but we will need the adversary to choose which of those polynomials are input to the $\mathsf {GDH}$ instance.

If we did bound the size n of the plaintext, by making a guess on the string $S_\beta =s^{(\beta )}_{1}\ldots s^{(\beta )}_{m}$, one could define a $\mathsf {GDH}$ instance providing all the elements of the public parameters, the trapdoors for every word W that does not match any of the substrings of $S_\beta $ containing $s_{i^*}^{(\beta )}$, the elements $\{g_i^a\}_{i=0}^{n-1}$ and $\{g_{i}^{a\cdot \alpha _{s_i}}\}_{i\in [0,n-1]\setminus \{i^*\}}$ along with the challenge element $U\in \mathbb {G}_1$ associated with the polynomial $f=a\cdot z^{i^*}\cdot \alpha _{s_{i^*}}$.

With such a $\mathsf {GDH}$ instance, the security proof becomes straightforward and only requires a proof that f does not depend on the polynomials underlying the provided elements. However, the reduction does not abort only if the initial guess is valid, which occurs with probability $\frac{1}{2^{n}}$.

So either we require n to be small (say $n \le 30$, for example) or we choose to rely on an interactive variant of the $\mathsf {GDH}$ assumption, in which the elements $g^{R^{(i)}(x_1,\ldots ,x_c)}$, $\widetilde{g}^{S^{(i)}(x_1,\ldots ,x_c)}$ and $e(g,\widetilde{g})^{T^{(i)}(x_1,\ldots ,x_c)}$ can be queried to specific oracles, to offer enough flexibility to the simulator.

The latter solution is less than ideal because it essentially makes the $\mathsf {GDH}$ instance interactive and consequently our construction will end up offering less security than a static assumption. Nevertheless, we argue that this solution remains of interest for two reasons. The first is that it allows to construct a quite efficient scheme with remarkable features: the size of the ciphertext is independent of the ones of the searchable strings, and the size of the trapdoors is independent of the size of the messages. Achieving this while being able to handle any trapdoor query is not obvious and may justify the use of an interactive assumption.

A second reason is that, intrinsically, the hardness of the $\mathsf {GDH}$ problem (proven in the generic group model [8]) relies on the same argument as its interactive variant : as long as the “challenge” polynomial f does not depend on ${<}\mathtt {R},\mathtt {S},\mathtt {T}{>}$, $g^{f(x_1,\ldots ,x_c)}$ is indistinguishable from a random element of $\mathbb {G}_1$. The fact that the sets $\mathtt {R}$, $\mathtt {S}$, and $\mathtt {T}$ are defined in the assumption or by the queries to oracles does not fundamentally impact the proof. We therefore define the interactive-$\mathsf {GDH}$ (i-$\mathsf {GDH}$) assumption and show that our scheme can be proven secure under it.

Definition 2

(i-$\mathsf {GDH}$ assumption). Let r, s, t, c, and k be five positive integers and $\mathtt {R}\in \mathbb {F}_p[X_1,\ldots ,X_c]^r$, $\mathtt {S}\in \mathbb {F}_p[X_1,\ldots ,X_c]^s$ and $\mathtt {T}\in \mathbb {F}_p[X_1,\ldots ,X_c]^t$ be three tuples of multivariate polynomials over $\mathbb {F}_p$. Let $\mathcal {O}^{\mathtt {R}}$ (resp. $\mathcal {O}^{\mathtt {S}}$ and $\mathcal {O}^{\mathtt {T}}$) be oracles that, on input $\{\{a_{i_1,\ldots , i_c}^{(k)}\}_{i_j=0}^{d_k}\}_k$, add the polynomials $\{\sum \limits _{i_1,\ldots ,i_c} a_{i_1,\ldots , i_c}^{(k)} \prod \limits _j X_j^{i_j}\}_k$ to $\mathtt {R}$ (resp. $\mathtt {S}$ and $\mathtt {T}$).

Let $(x_1,\ldots ,x_c)$ be a secret vector and $q_{\mathtt {R}}$ (resp $q_{\mathtt {S}}$) (resp. $q_{\mathtt {T}}$) be the number of queries to $\mathcal {O}^{\mathtt {R}}$ (resp. $\mathcal {O}^{\mathtt {S}}$) (resp. $\mathcal {O}^{\mathtt {T}}$). The i-$\mathsf {GDH}$ assumption states that, given the values $\{g^{R^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^{r+k\cdot q_R}$, $\{\widetilde{g}^{S^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^{s+k\cdot q_S}$ and $\{e(g,\widetilde{g})^{T^{(i)}(x_1,\ldots ,x_c)}\}_{i=1}^{t+k\cdot q_T}$, it is hard to decide whether $U = g^{f(x_1,\ldots ,x_c)}$ or U is random if f is independent of ${<}\mathtt {R},\mathtt {S},\mathtt {T}{>}$.

4.2 Security Results

Theorem 3

The scheme described in Sect. 3 is SEST-sIND-CPA secure under the i-$\mathsf {GDH}$ assumption for $\mathtt {R}$, $\mathtt {S}$, and $\mathtt {T}$ initially set as $\mathtt {R}=\{(z^i,x_j\cdot z^i, a\cdot z^i)\}_{i=0,j=0}^{i=2n-1,j=|\mathcal {S}|-1}$, $\mathtt {S}=\mathtt {T}=\emptyset $ and $f=a\cdot x_0\cdot z^n$.

Proof

Let $G_0^{(\beta )}$ denote the $\mathtt {Exp}_\mathcal {A}^{sind-cpa-\beta }$ game, as described in Sect. 2.2 – recall that this is the selective version of the IND-CPA security notion. Moreover, let $S_0=s_0^{(0)}\ldots s_{m-1}^{(0)}$ and $S_1=s_0^{(1)}\ldots s_{m-1}^{(1)}$ be the two substrings returned by $\mathcal {A}$ at the beginning of the game. Our proof uses a sequence of games $G_j^{(\beta )}$, for $j=1,\ldots ,n$, to argue that the advantage of $\mathcal {A}$ is negligible. This is a standard hybrid argument, in which at each game hop we randomize another element of the challenge ciphertext.

Let $\mathcal {I}_{\ne }$ be the set of indexes i such that $s_i^{(0)}\ne s_i^{(1)}$ and $\mathcal {I}_{\ne }^{(j)}$ be the subset containing the first j indexes of $\mathcal {I}_{\ne }$ (if $j>|\mathcal {I}_{\ne }|$, then $\mathcal {I}_{\ne }^{(j)}=\mathcal {I}_{\ne }$). For $j=1,\ldots ,n$, game $G_j^{(\beta )}$ modifies $G_0^{(\beta )}$ by switching the elements $C'_i$ of the challenge ciphertext to random elements of $\mathbb {G}_1$, for $i\in \mathcal {I}_{\ne }^{(j)}$. Ultimately, in the last game, $G_n^{(\beta )}$, the challenge ciphertext contains no meaningful information about $s_i^{(\beta )}$ $\forall i\in \mathcal {I}_{\ne }$, so the adversary cannot distinguish whether it plays $G_n^{(0)}$ or $G_n^{(1)}$.

In particular, we can write:

$$ \begin{array}{l} \mathtt {Adv}^{sind-cpa}_\mathcal {A}(1^k,n) \\ = |\Pr [\mathtt {Exp}_\mathcal {A}^{sind-cpa-1}(1^k,n)]-\Pr [\mathtt {Exp}_\mathcal {A}^{sind-cpa-0}(1^k,n)]|\\ = |G_0^{(1)}(1^k,n) - G_0^{(0)}(1^k,n)|\\ \le \sum \nolimits _{j=0}^{n-1} |G_{j}^{(1)}(1^k,n)- G_{j+1}^{(1)}(1^k,n)| \\ \quad + |G_{n}^{(1)}(1^k,n)- G_{n}^{(0)}(1^k,n)| \\ \quad + \sum \nolimits _{j=0}^{n-1} |G_{j+1}^{(0)}(1^k,n)- G_{j}^{(0)}(1^k,n)|\\ \le \sum \nolimits _{j=0}^{n-1} |G_{j}^{(1)}(1^k,n)- G_{j+1}^{(1)}(1^k,n)| \\ \quad + \sum \nolimits _{j=0}^{n-1} |G_{j+1}^{(0)}(1^k,n)- G_{j}^{(0)}(1^k,n)|. \end{array} $$

In order to bound this result, we must prove that $\mathcal {A}$ cannot distinguish $G_j^{(\beta )}$ from $G_{j+1}^{(\beta )}$, which is formally stated by the lemma below.

Assuming that this lemma were proved, each term above is negligible under the i-$\mathsf {GDH}$ assumption, which concludes the proof.

Lemma 4

For all $j=0,\ldots ,n-1$ and $\beta \in \{0,1\}$, the difference $|\mathtt {Pr}[G_j^{\beta }(1^k,n)=1] - \mathtt {Pr}[G_{j+1}^{\beta }(1^k,n)=1]|$ is negligible under the i-$\mathsf {GDH}$ assumption for $\mathtt {R}$, $\mathtt {S}$, and $\mathtt {T}$ initially set as follows: $\mathtt {R}=\{(z^i,x_j\cdot z^i, a\cdot z^i)\}_{i=0,j=0}^{i=2n-1,j=|\mathcal {S}|-1}$, $\mathtt {S}=\mathtt {T}=\emptyset $ and $f=a\cdot x_0\cdot z^n$.

The proof is provided in the full version [19].

5 Handling Regular Expressions

Our solution, introduced in Sect. 3, allows for pattern matching of keywords of arbitrary lengths, for ciphertexts emitted from arbitrary sources (we call this having universal tokens). In this section, we extend our notion of keyword-search to a more generic case, in which some of the keyword characters are fully-unknown (wildcards) and some are only partially-unknown (in an interval of size greater than 1).

Consider the general case in which one wants to search for substrings of the form $W=w_0\ldots w_{t-1} w_t^{(\mathcal {S}_t)} w_{t+1} \ldots w_{\ell -1}$ where $w_t^{(\mathcal {S}_t)}$ denotes any element from the set $\mathcal {S}_t\subset \mathcal {S}$. For example, $\mathcal {S}_t$ can be the set [0-9] of all integers between 0 and 9.

A trivial solution could be to issue a trapdoor for every possible value of $w_t$ but this would imply, for the gateway, to store the $|\mathcal {S}_t| $ resulting trapdoors and to test each of them separately. This not only raises a question of efficiency, but it also gives the gateway much more information on the input string. Intuitively, at the end of the search, the gateway will not only be able to tell that a given character is within a certain subset, but also which particular element of the subset it corresponds to.

In the following, we show how to modify our construction to allow for two notable regular expressions: wildcards and interval searches, without leaking any additional information, and with a minimal efficiency loss.

5.1 Handling Wildcards

The first case we consider assumes $W= w_0\ldots w_{i_1}^{(\mathcal {S}_{i_1})}\ldots w_{i_r}^{(\mathcal {S}_{i_r})}\ldots w_{\ell -1}$ with $\mathcal {S}_{i_1}=...=\mathcal {S}_{i_r}=\mathcal {S}$, which means that $w_{i_1}^{(\mathcal {S}_{i_1})},\ldots ,w_{i_r}^{(\mathcal {S}_{i_r})}$ can take any value from the set $\mathcal {S}$ and can consequently be seen as “wildcards”.

Informally, this implies that the $(j+i_{1})$-th,...,$(j+i_{r})$-th ciphertext elements must not be taken into account when testing if $C_{j}\ldots C_{j+\ell -1}$ encrypts W. This leads to the following variant of our main protocol where only the $\mathtt {Issue}$ and the $\mathtt {Test}$ algorithms differ (slightly) from the original ones.

$\mathtt {Issue}(W,\mathsf {sk})$: Let $\mathcal {D}= \{i_1,\ldots ,i_r\}$. The issuance process of a trapdoor $\mathsf {td}_W$ for $W = w_0\ldots w_{i_1}^{(\mathcal {S}_{i_1})}\ldots w_{i_r}^{(\mathcal {S}_{i_r})}\ldots w_{\ell -1}$ is described by Algorithm 2.

The only difference with the original $\mathtt {Issue}$ algorithm is the additional condition $i\notin \mathcal {D}$ which ensures that V will have no monomial of degree i for $i\in \mathcal {D}$.
$\mathtt {Test}(C,\mathsf {td}_W)$: this algorithm remains unchanged except that the trapdoor now contains the set $\mathcal {D}$. The process still consists of checking if the equality
$$(1)\quad \prod _{t=0}^{c-1} e(\prod _{i\in \mathcal {I}_t} C'_{j+i},\widetilde{g}^{L[t]}) = e(C_j,\widetilde{g}^V). $$
holds for $j=0,\ldots , m - \ell $.

One can note that this variant does not increase the complexity of our scheme. Actually, this is the opposite: all the indexes in $\mathcal {D}$ are discarded in the product of (1). Regarding security, one can note that the proof of Sect. 4 still applies here, since the latter does not require the coefficients $v_i$ to be different from 0.

5.2 Handling General Subsets

Now let us consider the general case where the substring W one wants to search contains $w_{i}^{(\mathcal {S}_{i})}$ for a subset $\mathcal {S}_i\subsetneq \mathcal {S}$. For example, $\mathcal {S}_i$ can be the set [0,9] of all the integers $x\in [0,9]$ or the set $ \{a,\ldots ,z\}$ of the letters of the Latin alphabet. Our construction can actually be modified to handle this kind of searches provided that: (1) the searchable sets $\mathcal {S}_{i}$ are known in advance, and can be used during the $\mathtt {Keygen}$ process; and (2) all these subsets are disjoint. We argue that both conditions are reasonable since this is often the case for regular expressions.

5.3 The Protocol

$\mathtt {Setup}(1^k,n)$: Let $(\mathbb {G}_1,\mathbb {G}_2,\mathbb {G}_T,e)$ be the description of type 3 bilinear groups of prime order p, this algorithm selects $g{\mathop {\leftarrow }\limits ^{{}_{\$}}}\mathbb {G}_1$ and $\widetilde{g}{\mathop {\leftarrow }\limits ^{{}_{\$}}}\mathbb {G}_2$ and returns $pp\leftarrow ( \mathbb {G}_1,\mathbb {G}_2,\mathbb {G}_T,e,g,\widetilde{g},n)$.
$\mathtt {Keygen}(\mathcal {S},\mathcal {S}^{(1)},\ldots ,\mathcal {S}^{(k)})$: This algorithm now takes as input k disjoint subsets of $\mathcal {S}$. We can assume, without loss of generality, that $\mathcal {S}= \mathcal {S}^{(1)}\cup \ldots \cup \mathcal {S}^{(k)}$ since we can simply add the complement of all previous sets if this is not the case. The function $f:\mathcal {S}\rightarrow \{1,\ldots ,k\}$ which maps any element $s\in \mathcal {S}$ to the index of the set $\mathcal {S}^{(j)}$ which contains it is thus perfectly defined. The algorithm then selects $|\mathcal {S}|+k+1$ random scalars $\{\alpha _s\}_{s\in \mathcal {S}},\beta _1,\ldots ,\beta _k,z{\mathop {\leftarrow }\limits ^{{}_{\$}}}\mathbb {Z}_p$ and computes $g_i\leftarrow g^{z^i}$ for $i=0,\ldots , n-1$ along with $(g_i^{\alpha _s},g_i^{\beta _d})$ for $d= 1,\ldots ,k$ and all $s\in \mathcal {S}^{(d)}$. The public key is then set to $\{g_i\}_{i=0}^{n-1} \cup _{d=1}^k \{(g_i^{\alpha _s},g_i^{\beta _d})\}_{i\in [0,n-1], s\in \mathcal {S}^{(d)}}$ and $\mathsf {sk}$ as $\{\alpha _s\}_{s\in \mathcal {S}},\beta _1,\ldots ,\beta _k,z$.
$\mathtt {Encrypt}(S,\mathsf {pk})$: To encrypt a string $S = s_0\ldots s_{m-1}$, where $m\le n$ the user selects a random scalar a and returns $C = \{(C_i,C_i^{(1)},C^{(2)}_i)\}_{i=0}^{m-1}$, where $C_i \leftarrow g_{i}^a$, $C^{(1)}_i \leftarrow (g_i^{\alpha _{s_i}})^a$ and $C^{(2)}_i \leftarrow (g_i^{\beta _{f(s_i)}})^a$, for $i=1\ldots m$.
To issue a trapdoor $\mathsf {td}_W$ for a string $W = w_1\ldots w_{i_1}^{(\mathcal {S}_{i_1})}\ldots w_{i_r}^{(\mathcal {S}_{i_r})}\ldots w_\ell $ of length $\ell \le n$, the algorithm first checks that all the involved subsets have been taken as input by the $\mathtt {Keygen}$ algorithm, i.e. $\mathcal {S}_{i_j} \in \{ \mathcal {S}^{(1)},\ldots ,\mathcal {S}^{(k)}\}$ for $j=1,\ldots ,r$, and returns $\perp $ otherwise. The function h which maps every index $i_j$ to the integer $d\in \{1,\ldots ,k\}$ such that $\mathcal {S}_{i_j}= \mathcal {S}^{(d)}$ is thus correctly defined. Let $\mathcal {D}=\{i_1,\ldots , i_r\}$, we modify the original $\mathtt {Issue}$ procedure as described in Algorithm 3.
$\mathtt {Test}(C,\mathsf {td}_W)$: To test whether the string S encrypted by C contains the substring W, the algorithm parses $\mathsf {td}_W$ as $(c,\mathcal {D},\{\mathcal {I}_j\}_{j=0}^{c-1},\{\widetilde{g}^{L[j]}\}_{j=0}^{c-1},\widetilde{g}^V)$ and C as $\{(C_i,C_i^{(1)},C^{(2)}_i)\}_{i=0}^{m-1}$ and checks, for $j=0,\ldots , m - \ell $, if the following equation holds:
$$\prod _{t=0}^{c-1} e((\prod _{i\in \mathcal {I}_t \wedge i\notin \mathcal {D}} C^{(1)}_{j+i})(\prod _{i\in \mathcal {I}_t \wedge i\in \mathcal {D}} C^{(2)}_{j+i}),\widetilde{g}^{L[t]}) = e(C_j,\widetilde{g}^V).$$
It then returns the set (potentially empty) $\mathcal {J}$ of indexes j for which there is a match.

The values $\beta _j$ defined in this protocol can be seen as an encoding of the subset $\mathcal {S}^{(j)}$, in the same way as the scalars $\alpha _s$ encode the characters $s\in \mathcal {S}$. Actually, it is as if we worked with a larger set $\mathcal {S}'$ containing $\mathcal {S}$ but also the “characters” $\mathcal {S}^{(j)}$. The fact that one encrypts using both encodings makes the ciphertext compatible with any kind of trapdoors: if the i-th element of W is of the form $w_j$, we use $C_j^{(1)}$, whereas we use $C_j^{(2)}$ for an element of the form $w_j^{(\mathcal {S}_j)}$. Correctness and security follow directly from the original construction.

Regarding efficiency, encrypting for both encodings adds an element of $\mathbb {G}_1$ by character to the ciphertext. Nevertheless, as we explain in the next section, working with a larger set $\mathcal {S}'$ allows to reduce the number of random scalars that we need to generate the trapdoors, which leads to a faster $\mathtt {Test}$ procedure.

6 The Complexity of Our Scheme

We describe in this section the timings one can get for different parameters. But first we discuss the different strategies for choosing the set $\mathcal {S}$.

6.1 Generic Complexity

When considering data streams, the most relevant sets are the one of bits ($i.e. $ $\mathcal {S}=\{0,1\}$) or the one of bytes ($i.e. $ $\mathcal {S}=\{0,\ldots ,255\}$). Larger sets (for example the one containing all sequences of r bytes for some $r>1$) would improve the efficiency of the $\mathtt {Test}$ procedure but would harm our ability to detect all patterns. We focus on four specific points: the sizes of (1) the public key, of (2) the ciphertext and of (3) the trapdoor along with (4) the number of pairings required to detect the presence of a pattern of size $\ell $.

1.
The size of $\mathsf {pk}$. Let n be the maximum number of bytes one can encrypt with the protocol of Sect. 3.3. If $\mathcal {S}=\{0,1\}$, then the public key contains $(1+2)8n$ elements of $\mathbb {G}_1$ which amounts to 768n bytes using Barreto-Naehrig (BN) [6] curves. If we now consider bytestrings ($i.e. $ $\mathcal {S}=\{0,\ldots ,255\}$), then $\mathsf {pk}$ contains $(1+256)n$ elements of $\mathbb {G}_1$ which amounts to 8224n bytes using the same curves.
2.
The length of the ciphertext. Each character is encrypted by 2 elements of $\mathbb {G}_1$ that represent 64 bytes. Therefore, encrypting m bytes requires 512m bytes if $\mathcal {S}=\{0,1\}$ and 64m bytes if $\mathcal {S}=\{0,\ldots ,255\}$.
3.
The size of $\mathsf {td}_W$. Our algorithm makes this evaluation much more difficult to perform. Indeed, the fact that we can reuse the same random scalar for two different characters $w_i\ne w_j$ implies that the size of $\mathsf {td}_W$ strongly depends on the keyword W itself. For example, a “constant” keyword $W=s\ldots s$ of size $\ell $ would entail a trapdoor containing $\ell +1$ elements of $\mathbb {G}_2$. Conversely, a keyword $W=w_0\ldots w_{\ell -1}$ with $w_i\ne w_j$ for $i\ne j$ would only require to store 2 elements of $\mathbb {G}_2$. Nevertheless, we notice that larger sets decrease the probability of having equal characters. More specifically, assuming uniform distribution of the characters within a keyword, a trapdoor contains, on average, $(1+\lceil \ell /2\rceil )$ elements of $\mathbb {G}_2$ if $\mathcal {S}=\{0,1\}$ and only $(1+\lceil \ell /256\rceil )$ if $\mathcal {S}=\{0,\ldots ,255\}$. We can then hope to gain a factor 128 in the latter case.
4.
The number of pairings. The number of pairings one must compute to test the presence of a keyword W of length $\ell $ within an encrypted string is related to the size of the corresponding trapdoor $\mathsf {td}_W$. More specifically, if $\mathsf {td}_W$ contains N elements of $\mathbb {G}_2$, then one must perform $N(m-\ell +1)$ pairings, where m is the length of the encrypted string. Therefore, a shorter trapdoor implies a more efficient $\mathtt {Test}$ procedure, which means that it is better to work with $\mathcal {S}=\{0,\ldots ,255\}$ than with $\mathcal {S}=\{0,1\}$.

Public key aside, we note that working on bytes instead of bits allows to significantly decrease complexity. Our timings then correspond to the case where $\mathcal {S}=\{0,\ldots ,255\}$.

6.2 Implementation of SEST for DPI

As we explain, evaluating the size of the trapdoors, and therefore the number of pairings requires to make assumptions about the distribution of the keywords. Previous estimations assumed a uniform distribution of the latter, which is unlikely in practice. We therefore evaluate our protocol on the SNORT public rules set [1] to provide a more concrete estimation^{Footnote 3}.

The SNORT rules set contains thousands of rules which mostly consist in searching some specific patterns in a stream. We parsed all these rules and got 6048 different patterns. Figure 3 describes the sizes of the corresponding trapdoors.

This table highlights the advantage of our issuing protocol: even for large patterns we manage to keep most of the time short trapdoors thanks to the re-use (when possible) of the random scalars. The whole trapdoors set thus only amounts to 1.35 MB.

Since the number of pairings is related to the size of the trapdoors, one could try to deduce from this table the total number of pairings required to test all SNORT patterns. However, we stress that this would only be a quite inaccurate upper bound. First, because many of these patterns are part of the same rule which enables to avoid unnecessary tests: if there is no match for a pattern defined by a rule, then it is pointless to test the other ones within the same rule. Second, because many rules include parameters called “depth”, “offset”, “distance” or “within” which allow to reduce the search to a smaller part of the stream.

The number of pairings for the whole SNORT rules set is thus significantly smaller than the one we could expect from the complexity evaluation we provide in Sect. 6.1. Moreover, we recall that the optimal Ate pairing [34] that we use to instantiate the map e can be split into two parts that are usually called the Miller loop and the final exponentiation. The latter, which roughly represents half of the computational cost of a pairing, can be performed once for all the pairings involved in the same equality test, which allows to further reduce the complexity of the $\mathtt {Test}$ procedure.

We ran an experiment on a stream of 1500 bytes using a computer running Linux 4.13 and equipped with an Intel E5-1620 3.70 GHz processor. Testing all Snort rules took 28 min. This is obviously too much for online analysis but we stress that alternatives (e.g. FHE) offering the same features would be even more complex. Moreover, this corresponds to testing thousands of patterns on a single computer: by using parallelization and more powerful hardware, one could hope to dramatically reduce these timings.

Finally, we provide in Fig. 4 the timings of the $\mathtt {Encrypt}$ and the $\mathtt {Test}$ algorithms for larger strings (up to 30 KB). It shows that encryption remains quite efficient even for large strings. The $\mathtt {Test}$ algorithm is obviously slower since it implies pairings computations but it takes (approximatively) only one second for strings of few kilobytes.

7 Conclusion

In this work, we introduced the concept of searchable encryption with shiftable trapdoors (SEST). This type of construction provides a practical solution to the generic problem of pattern matching with universal tokens. Notably, we are the first to provide a searchable encryption alternative that allows for arbitrarily-chosen keywords of arbitrary length, which can be applied to any ciphertext encrypted with the generated public key in this system. In particular, since we do not rely on symmetric keys, multiple entities can use the same public key to encrypt. Moreover, our construction is also highly usable for encrypted streams of data (we need no backtracking), and it returns the exact position at which the pattern occurs. Our instantiation of the SEST primitive uses bilinear pairings, and we allow for some regular expressions such as wildcards, or partial keywords in which we know some entries to be within a given interval.

Beyond applications in deep-packet inspection, the fact that our algorithm essentially follows the approach of Rabin-Karp allows us to also use that same algorithm for application scenarios such as searching on structured data, matching subtrees to labelled trees, delegated searches on medical data (compiled from multiple institutions), or 2D searches.

We propose a main construction, which we adapt to accounting for wildcards and for interval searches. The former adaptation is relatively simple, since the issued trapdoor just contains zero coefficients for the wildcards. For the interval searches we need to modify our key generation algorithm, providing special elements that we map interval characters to; however, this only works for intervals which are known in advance.

Our scheme provides trapdoors for the keywords which are at most linear in the size of the keywords only, and the size of the ciphertexts is linear in the size of the plaintext size. Although our public keys are large (linear in the size of the maximal plaintext size), we do achieve a complete decorrelation between the plaintext encryption and the trapdoor generation for the keywords. Our scheme provides in practice an almost linear – in the size of the plaintext – complexity (in terms of the number of pairings). Our implementation results for the publicly-given SNORT rules show that while the encryption algorithm scales well with the plaintext size, the testing algorithm – which is slower – will benefit from the fact that it is fully parallelizable.

We prove the security of our scheme under an interactive version of the $\mathsf {GDH}$ assumption. Our modification of this assumption is relatively minor, allowing the adversary to choose on which input to play the $\mathsf {GDH}$ instance. We also argue that our construction offers an interesting tradeoff between the secure, but quite cumbersome, systems based on existing cryptographic primitives and the fast, but unsecure, current solutions where the gateway decrypts the traffic. Moreover, we hope that the practical applications of this primitive will incite new work on this subject, in particular to construct new schemes which would rely on standard assumptions.

Notes

1.
By contrast, in many cases, the patterns themselves may be publicly known.
2.
Solutions using tokenization, such as Blindbox, also output the position. Here we compare with standard searchable encryption that usually does not reveal this information.
3.
We stress that the only goal of this section is to provide timings on a concrete and non-artificial set of patterns. We chose the DPI use-case for which searching on encrypted streams is particularly relevant. But we obviously do not claim that our solution is practical enough to handle all Internet traffic worldwide.

References

https://www.snort.org/
Abdalla, M., et al.: Searchable encryption revisited: consistency properties, relation to anonymous IBE, and extensions. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 205–222. Springer, Heidelberg (2005). https://doi.org/10.1007/11535218_13
Chapter Google Scholar
Abdalla, M., Bourse, F., De Caro, A., Pointcheval, D.: Simple functional encryption schemes for inner products. In: Katz, J. (ed.) PKC 2015. LNCS, vol. 9020, pp. 733–751. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46447-2_33
Chapter Google Scholar
Agrawal, S., Gorbunov, S., Vaikuntanathan, V., Wee, H.: Functional encryption: new perspectives and lower bounds. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013. LNCS, vol. 8043, pp. 500–518. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40084-1_28
Chapter Google Scholar
Baron, J., El Defrawy, K., Minkovich, K., Ostrovsky, R., Tressler, E.: 5PM: secure pattern matching. In: Visconti, I., De Prisco, R. (eds.) SCN 2012. LNCS, vol. 7485, pp. 222–240. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32928-9_13
Chapter Google Scholar
Barreto, P.S.L.M., Naehrig, M.: Pairing-friendly elliptic curves of prime order. In: Preneel, B., Tavares, S. (eds.) SAC 2005. LNCS, vol. 3897, pp. 319–331. Springer, Heidelberg (2006). https://doi.org/10.1007/11693383_22
Chapter Google Scholar
Beuchat, J.-L., González-Díaz, J.E., Mitsunari, S., Okamoto, E., Rodríguez-Henríquez, F., Teruya, T.: High-speed software implementation of the optimal ate pairing over Barreto–Naehrig curves. In: Joye, M., Miyaji, A., Otsuka, A. (eds.) Pairing 2010. LNCS, vol. 6487, pp. 21–39. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17455-1_2
Chapter MATH Google Scholar
Boneh, D., Boyen, X., Goh, E.-J.: Hierarchical identity based encryption with constant size ciphertext. In: Cramer, R. (ed.) EUROCRYPT 2005. LNCS, vol. 3494, pp. 440–456. Springer, Heidelberg (2005). https://doi.org/10.1007/11426639_26
Chapter Google Scholar
Boneh, D., Di Crescenzo, G., Ostrovsky, R., Persiano, G.: Public key encryption with keyword search. In: Cachin, C., Camenisch, J.L. (eds.) EUROCRYPT 2004. LNCS, vol. 3027, pp. 506–522. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24676-3_30
Chapter Google Scholar
Boneh, D., Raghunathan, A., Segev, G.: Function-private identity-based encryption: hiding the function in functional encryption. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013. LNCS, vol. 8043, pp. 461–478. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40084-1_26
Chapter MATH Google Scholar
Boneh, D., Waters, B.: Conjunctive, subset, and range queries on encrypted data. In: Vadhan, S.P. (ed.) TCC 2007. LNCS, vol. 4392, pp. 535–554. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-70936-7_29
Chapter Google Scholar
Boyen, X.: The uber-assumption family. In: Galbraith, S.D., Paterson, K.G. (eds.) Pairing 2008. LNCS, vol. 5209, pp. 39–56. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85538-5_3
Chapter Google Scholar
Brakerski, Z., Segev, G.: Function-private functional encryption in the private-key setting. In: Dodis, Y., Nielsen, J.B. (eds.) TCC 2015. LNCS, vol. 9015, pp. 306–324. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46497-7_12
Chapter Google Scholar
Canard, S., Diop, A., Kheir, N., Paindavoine, M., Sabt, M.: BlindIDS: market-compliant and privacy-friendly intrusion detection system over encrypted traffic. In: Karri, R., Sinanoglu, O., Sadeghi, A.-R., Yi, X. (eds.) Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, 2–6 April 2017, pp. 561–574. ACM (2017)
Google Scholar
Canetti, R., Halevi, S., Katz, J.: A forward-secure public-key encryption scheme. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 255–271. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-39200-9_16
Chapter Google Scholar
Chase, M., Kamara, S.: Structured encryption and controlled disclosure. In: Abe, M. (ed.) ASIACRYPT 2010. LNCS, vol. 6477, pp. 577–594. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17373-8_33
Chapter Google Scholar
Chase, M., Shen, E.: Substring-searchable symmetric encryption. PoPETs 2015(2), 263–281 (2015)
Google Scholar
Curtmola, R., Garay, J.A., Kamara, S., Ostrovsky, R.: Searchable symmetric encryption: improved definitions and efficient constructions. In: Juels, A., Wright, R.N., De Capitani di Vimercati, S. (eds.) ACM CCS 06, pp. 79–88. ACM Press, October/November 2006
Google Scholar
Desmoulins, N., Fouque, P.-A., Onete, C., Sanders, O.: Pattern matching on encrypted streams (full version). IACR Cryptology ePrint Archive 2017:148 (2017)
Google Scholar
Galbraith, S.D., Paterson, K.G., Smart, N.P.: Pairings for cryptographers. Discrete Appl. Math. 156(16), 3113–3121 (2008)
Article MathSciNet Google Scholar
Gennaro, R., Hazay, C., Sorensen, J.S.: Automata evaluation and text search protocols with simulation-based security. J. Cryptol. 29(2), 243–282 (2016)
Article MathSciNet Google Scholar
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Mitzenmacher, M. (ed.) 41st ACM STOC, pp. 169–178. ACM Press, May/June 2009
Google Scholar
Hazay, C., Lindell, Y.: Efficient protocols for set intersection and pattern matching with security against malicious and covert adversaries. J. Cryptol. 23(3), 422–456 (2010)
Article MathSciNet Google Scholar
Huang, L.-S., Rice, A., Ellingsen, E., Jackson, C.: Analyzing forged SSL certificates in the wild. In: 2014 IEEE Symposium on Security and Privacy, pp. 83–97. IEEE Computer Society Press, May 2014
Google Scholar
Jarmoc, J.: SSL/TLS interception proxies and transitive trust. Presentation at Black Hat Europe (2012)
Google Scholar
Katz, J., Malka, L.: Secure text processing with applications to private DNA matching. In: Al-Shaer, E., Keromytis, A.D., Shmatikov, V. (eds.) ACM CCS 10, pp. 485–492. ACM Press, October 2010
Google Scholar
Katz, J., Sahai, A., Waters, B.: Predicate encryption supporting disjunctions, polynomial equations, and inner products. J. Cryptol. 26(2), 191–224 (2013)
Article MathSciNet Google Scholar
Lauter, K., López-Alt, A., Naehrig, M.: Private computation on encrypted genomic data. In: Aranha, D.F., Menezes, A. (eds.) LATINCRYPT 2014. LNCS, vol. 8895, pp. 3–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16295-9_1
Chapter Google Scholar
Mohassel, P., Niksefat, S., Sadeghian, S., Sadeghiyan, B.: An efficient protocol for oblivious DFA evaluation and applications. In: Dunkelman, O. (ed.) CT-RSA 2012. LNCS, vol. 7178, pp. 398–415. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27954-6_25
Chapter Google Scholar
Naylor, D., et al.: Multi-context TLS (mcTLS): enabling secure in-network functionality in TLS. In: Proceedings of SIGCOMM 2015, pp. 199–212. ACM (2015)
Google Scholar
Sherry, J., Lan, C., Popa, R.A., Ratnasamy, S.: BlindBox: deep packet inspection over encrypted traffic. In: Uhlig, S., Maennel, O., Karp, B., Padhye, J. (eds.) SIGCOMM 2015, pp. 213–226. ACM, August 2015
Google Scholar
Song, D.X., Wagner, D., Perrig, A.: Practical techniques for searches on encrypted data. In: 2000 IEEE Symposium on Security and Privacy, pp. 44–55. IEEE Computer Society Press, May 2000
Google Scholar
Troncoso-Pastoriza, J.R., Katzenbeisser, S., Celik, M.: Privacy preserving error resilient DNA searching through oblivious automata. In: Ning, P., De Capitani di Vimercati, S., Syverson, P.F. (eds.) ACM CCS 07, pp. 519–528. ACM Press, October 2007
Google Scholar
Vercauteren, F.: Optimal pairings. IEEE Trans. Inf. Theory 56(1), 455–461 (2010)
Article MathSciNet Google Scholar

Download references

Acknowledgments

Nicolas Desmoulins and Olivier Sanders were supported in part by the French ANR Project ANR-16-CE39-0014 PERSOCLOUD. Pierre-Alain Fouque and Cristina Onete are grateful for the support of the ANR through project 16 CE39 0012 (SafeTLS).

Author information

Authors and Affiliations

Orange Labs, Applied Crypto Group, Caen, France
Nicolas Desmoulins
Université de Rennes 1 & Institut Universitaire de France, Rennes, France
Pierre-Alain Fouque
Université de Limoges, CNRS UMR 7252, Limoges, France
Cristina Onete
Orange Labs, Applied Crypto Group, Cesson-Sévigné, France
Olivier Sanders

Authors

Nicolas Desmoulins
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Alain Fouque
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Onete
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Sanders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Sanders .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Thomas Peyrin
University of Auckland, Auckland, New Zealand
Steven Galbraith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Desmoulins, N., Fouque, PA., Onete, C., Sanders, O. (2018). Pattern Matching on Encrypted Streams. In: Peyrin, T., Galbraith, S. (eds) Advances in Cryptology – ASIACRYPT 2018. ASIACRYPT 2018. Lecture Notes in Computer Science(), vol 11272. Springer, Cham. https://doi.org/10.1007/978-3-030-03326-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-03326-2_5
Published: 27 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03325-5
Online ISBN: 978-3-030-03326-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the International Association for Cryptologic Research (opens in a new tab)