Privacy Preserving Record Matching Using Automated Semitrusted Broker
 6 Citations
 882 Downloads
Abstract
In this paper, we present a novel scheme that allows multiple data publishers that continuously generate new data and periodically update existing data, to share sensitive individual records with multiple data subscribers while protecting the privacy of their clients. An example of such sharing is that of health care providers sharing patients’ records with clinical researchers. Traditionally, such sharing is performed by sanitizing personally identifying information from individual records. However, removing identifying information prevents any updates to the source information to be easily propagated to the sanitized records, or sanitized records belonging to the same client to be linked together. We solve this problem by utilizing the services of a third party, which is of very limited capabilities in terms of its abilities to keep a secret, secret, and by encrypting the identification part used to link individual records with different keys. The scheme is based on strong security primitives that do not require shared encryption keys.
Keywords
Record Matching Inverted Index Retrospective Query Privacy Preserving Record Linkage Searchable Encryption Scheme1 Introduction
Many applications exist where a group of data sources (publishers) continuously generate sensitive data, periodically update the same, and share the data with another group of data analyzers (subscribers). To protect the privacy of the clients of the publishers, the data sharing needs to occur in a privacypreserving manner, which in its simplest form is enabled by removing identifying information from the data. An example of such data sharing is observed in the socalled clinical datasharing networks. Different health care providers (e.g., medical clinics, laboratories, hospitals and pharmacies) are the publishers of the data for the networks while clinical researchers are the subscribers of the data. Unlike the traditional privacypreserving data publishing domain, the data in such clinical datasharing networks are not static but are updated every time a patient interacts with a data publisher.
Owing to the updatable nature of the data, a unique and challenging situation occurs in such applications that is not observed in traditional privacy preserved data publishing setups. Any updates to a record on the publisher side must be pushed to the corresponding record on the subscriber side even though these two records have been delinked via sanitization algorithms. Consider the clinical datasharing example. Assume that a clinical researcher needs data related to a specific demography. In this case, patients’ identification information (such as SSN, driver’s license number, date of birth, etc. ) are typically removed when the medical information is shared with the researcher. To provide the most relevant and current data, patients’ progress under treatment regimens would need to be propagated to the clinical researcher. Similarly, the researcher should also be able to pull updates from the publisher or at a minimum be able to query the publisher for updates. To allow such sharing of information, records at the publisher need to be somehow linked back to corresponding sanitized records at the subscriber in a privacy preserving manner.
Things become more challenging if this sharing needs to be carried out between multiple publishers and multiple subscribers. Publishers are often business competitors and unwilling to reveal to each other that they might be sharing clients between themselves. In such cases, two publishers should not know that they have a common group of clients. (Sharing such information under explicit directives from a client is allowed and is not considered here). For privacy reasons, two subscribers should not be able to determine that they have clients in common; they should not be able to link or trace two sanitized records to the same client. When a publisher has more than one record for the same client, the same number of sanitized records should be available at the subscriber and be updated as needed. This occurs, for example, when a patient has repeated visits to the doctor for treatment.
Not much work has been done in this area of privacy preserving record linkage in dynamic setting. Some existing techniques that partially address the problem require encryption of linkage information using a shared key between data publishers to find if matched individuals’ data exist across multiple sites. However, this technique works for small communities; it is expensive to deploy in large heterogeneous setups. In addition, shared keys among a large number of entities increase the chances of key leakage. An alternative technique that is used in the medical community is to utilize the services of a trusted third party, HB, called the Honest Broker. The third party maintains the linking information between the subscriber data and the publisher data in a deidentified manner. The problem with this approach is that the Honest Broker has all information, which makes it a lucrative target for attackers. If the Honest Broker is compromised it will cause a catastrophic damage to both data publishers as well as to individual clients.
In this work, we address this problem of privacy preserving record linkage by proposing a secure scheme based on partially homomorphic encryption and a third party. The third party is responsible just for automatically and blindly perform record matching. It is honest in the sense that it follows the protocol correctly but is not trusted to keep a secret, secret. It is curious about the sensitive information contained in individual records and can act accordingly. However, our protocol ensures that it is prevented from getting such information without colluding with publishers. The third party knows that two publishers or subscribers have clients in common; however, it does not know the identities of these clients.
The main idea behind our protocol is as follows. Each data publisher creates a “key converter” in collaboration with other data publishers. Each key converter is then given to the third party (henceforth referred to simply as the broker). Each data publisher encrypts critical identification information of its data using it own secret key. Later, the broker uses the “key converters” to blindly transform all publisherencrypted identification information to an alternate encrypted form under some other key that is not known to any party including the broker itself. The broker needs to collude with at least one of the publishers to find that key. Once the linkage information is encrypted under the same key, the broker can easily determine matching records. The broker can also keep track of the individuals found at different sites for retrospective update queries purposes. Since the data is encrypted at the source with different keys that the broker does not have access to, the risk of privacy breach in case of the broker getting compromised is limited.
The rest of the paper is organized as follows: In Sect. 2, we walk through previous works related to the underlined problem. Section 3 gives some background information about primitives used to construct the scheme. In Sect. 4, we detail our construction. In Sect. 5, we provide a brief complexity and security analysis. Finally, Sect. 6 concludes the paper.
2 Related Work
Privacy preserving data sharing has been a well studied problem, particularly in the context of sharing information from databases controlled by multiple parties. The participating parties in a privacy preserving database querying system are: the data owner, variously called the data source or publisher of the data, who provides access to its data to others, the data querier who generates query against the publisher’s data and receives the results, and the host who (potentially) stores the publisher’s data, and executes the query by performing relevant computations.
In our setting, we respectively denote by publisher, subscriber and broker the data owner, data querier and the host. The challenging scenario addressed in this work considers many competitor publishers that do not want to reveal any information about their data to each other but would like to anonymously and securely share some information with the subscriber. In addition, the subscriber is not only interested in querying their data separately, but jointly in order to find connected records across the databases. Furthermore, the subscriber wants to be able to retrieve updates about some previously queried entities.
If the querier is the owner of the data itself, and the host is an untrusted outsourced third party, searchable encryption schemes or Oblivious RAM (ORAM) are used to protect the data and maintain the owner privacy. Considerable amount of research has been done towards this problem that resulted in very efficient and expressive schemes [1, 2, 3, 4, 5, 6]. However, in searchable encryption schemes, the data to be joined must be encrypted under the same key that has been shared among the owners. So searchable encryption schemes cannot directly be applied for our scenario.
In some settings, there are at least two parties each hosting their own data and either one or both of them wants to query other’s data against his data. Each of them is considered the adversary to the other and ideally should not learn any thing beyond the query result. For such cases, private set intersection [7, 8] and multiparty computation are potential solutions [9]. For instance, Agrawal et al. [10] propose a set of operations on private databases between two parties where nothing is learned by the parties rather than the results of the operations. De Cristofaro et al. [11] make also use of set intersection techniques, yet using a certification authority (CA) to authorize queries and leverage client and server privacy. However, these solutions although appealing from a security perspective are not very efficient for large settings.
Solutions proposed in [12, 13] introduce a third party (Honest Broker) who will work as a mediator between the querier and the owners. This solution is impractical and not secure. In fact, although these solutions could be easy to deploy, noncryptographic hash usage makes unauthorized users able to compute the hash; if the domain of the information is small or well known then it is possible to find the original information or at least to verify if some value exists or not. If pseudorandom functions are used instead, then a shared secret key must be used, which is undesirable in our setup because of the competition between different publishers. In addition, owing to the significant amount of trust required on the Honest Broker it becomes an appealing target for the attackers and malicious insiders [14, 15].
Yin and Yau [16] propose a privacy preserving repository for data integration across data sharing services that allows owners to specify integration requirements and data sharing services to safeguard their privacy. The owners determine who and how their data can be used. The authors presented what they call context aware data sharing to help data services to share data with the repository. However, the matching process between records is done using the hash value of their identification information, which is not secure and does not preserve privacy. This scheme suffers from the same underlying problems of the previous construction, namely, a mandatory secret sharing between competing parties. Carbunar and Sion [17] introduce a mechanism for executing a general binary join operation on an outsourced relational database with low overhead using predefined binary finite match predicates for computed match set. However they assume that the keys of both columns used in the join are known to the data owner. If the data belongs to two different owners, this key must be shared beforehand – essentially the same problem as earlier.
Chow et al. [18] propose a model for performing privacy preserving operations in a distributed database environment. The model, called TwoParty Query computation, comprises a randomizer and computing engine that do not reveal any information between themselves. The model in essence emulates a central party with the functions split into two entities. In order to guarantee that the randomizer and the computing engine do not collude, the authors proposes to use key agreement protocol among the participating entities. This protocol has limited applicability to our scenario since it doesn’t not support the retrospective queries.
Finally, a solution presented by Tassa et al. [19] targets gathering data between horizontally or vertically divided dataset while preserving sensitive information. The computation is over sanitized data set using kanonymity or ldiversity sanitizing techniques. Their main focus is on anonymization of distributed databases in such a way that each party locally anonymizes its data without revealing its original content to others. While the ideas and techniques of this work are suitable for distributed data mining problems, since our requirements are quite different Tassa et al.ś work cannot be easily used to serve our needs.
3 Background
In the following, we recapitulate some important concepts that we use to build our protocols.
3.1 ElGamal Cryptosystem
ElGamal publickey cryptosystem [20] is defined over finite field \(\mathbb {F}_q\) of a prime order q. The public key \(\mathsf {pk}\) equals (G, q, g, h), where G is a cyclic group of order q with g as a generator, and \(h = g^x\). The private key \(\mathsf {sk}\) equals \(x \xleftarrow {R} \{1,\cdots ,q1\}\).
The encryption of a message m using the public key is \((c_1,c_2)\) and computed such that \((c_1,c_2) = \mathsf {Enc_{pk}}(m) = (g^r,m \cdot h^r)\), where \(r \xleftarrow {R} \{0, \ldots , q1\}\). To decrypt the ciphertext \((c_1,c_2)\) and retrieve \(m=\mathsf {Dec_{sk}}(c_1,c_2)\), the private key is needed as follows: first compute \(s=c_1^{\mathsf {sk}}=(g^r)^x\). The decrypted message m equals \(m= c_2 \cdot s^{1}=m \cdot h^r \cdot g^{rx} = m \cdot (g^x)^r \cdot g^{rx}=m\).
3.2 Elliptic Curve Cryptography and the Discrete Logarithm Problem (ECDLP)
An Elliptic Curve [21] E over a finite field \(\mathbb {F}_q\) of prime order q, is a set of points with coordinates from that field defined by an equation of the form \(y^2+a_1xy+a_3y=x^3+a_2x^2+a_4x+a_6\) for all \(a_i \in \mathbb {F}_q\), or of the simpler form \(y^2=x^3+ax+b\), with \(a,b \in \mathbb {F}_q\) for finite fields of order different from 2 or 3. The coefficients define the shape of the curve and are the parameters of the curve. The set of points together with a special point called the point at infinity \(\mathcal {O}\), form a group under the addition of points operation. Multiplication of points is not defined; however, multiplying a point P by a scalar u is defined as the addition of the point P u number times, i.e. \(uP=\underbrace{P+P+\dots +P}_{u~times}\). The cyclic subgroup \(E(\mathbb {F}_q)\) is defined by its generator (base point) P with order n, which is the smallest positive integer such that \(nP=\mathcal {O}\). This subgroup \(E(\mathbb {F}_q)=\{\mathcal {O},P,2P,\dots ,(n1)P \}\) is denoted by its domain parameters (q, a, b, P, n).
Given an Elliptic Curve \(E(\mathbb {F}_q)\) over a finite field and two points \(P, Q \in E\), it is hard to find an integer \(x \in \mathbb {Z}_q\) such that \(Q=xP\). This is known as the Elliptic Curve Discrete Logarithm Problem (ECDLP). ECDLP is believed to be harder than finding Discrete Logarithms for finite fields, which is why many public key cryptosystems uses Elliptic Curves (EC) as the underlying group. For our purposes, the ElGamal cryptosystem and its variations are defined using EC as follows.
First, the communicating parties agree on the EC parameters and the corresponding field, i.e., \(E(\mathbb {F}_q)\), the generator point G, and its order n. From these parameters each party generates its private key as \(\mathsf {sk}_i=x, x\overset{R}{\leftarrow } \{1,\cdots ,n1\}\), and its public key as the point \(\mathsf {pk}_i=xG\). The messages to be encrypted must be encoded as points on the curve in order to apply the group addition, or the messages must be integer scalars in the range \(\{1,\cdots ,n1\}\) to use multiplications of points by scalars. The encryption of encoded message M is then performed by the sender A using the recipient B’s public key \(\mathsf {sk}_B\) as \(C_1= r \cdot G\), and \(C_2=r\cdot \mathsf {pk}_B +M\), where \(r\overset{R}{\leftarrow }\{1,\cdots ,n1\}\). The decryption at the receiver side B is done using its private key \(\mathsf {sk}_B\) as \(M=C_2\mathsf {sk}_B\cdot C_1=r\cdot \mathsf {pk}_B+Msk_B\cdot r\cdot G= r\cdot k\cdot Gk\cdot r\cdot G+M = M\).
3.3 Homomorphic Encryption
Homomorphic encryption allows arithmetic operations to be carried out on ciphertext in such a way that the decrypted result matches the result of the operations when performed on the plaintext. Partially Homomorphic Encryption system (PHE) allows either addition or multiplication but not both. In this work, we focus only on the multiplicative property.
4 Scheme Construction
Our scheme works in three phases: the setup phase, the encryption of query results phase, and the secure record matching phase. The setup phase is executed only once by the data publishers to create the socalled publishers’ “key converters”. It utilizes properties of homomorphic encryption to create these “key converters”. The encryption of query results phase occurs at the publisher side whenever it executes a query, to encrypt the identifying part of the query results. This encryption is performed using the publisher’s secret nonshared key before sending the results to the broker. The secure record matching phase occurs at the broker side to determine the linkages between the records coming from the publishers after executing the queries.
We begin by discussing the adversarial model and privacy requirements, then we explain each phase in details.
4.1 Adversary Model and Privacy Requirements
Data publishers are considered competitors who do not want to share data with each other. Each publisher tries to determine information about the clients of competitor publisher, or at a minimum, tries to determine if any one of its clients is also a client of its competitor. However, publishers are also honest in the execution of the protocol steps and willing to share information with subscribers privately, that is, without revealing real identities, and securely, that is, without any leakage of information to other data publishers, and without revealing the publisher identity to the subscribers.
A data subscriber, on the other hand, needs to determine if any information that came from different publishers belong to the same individual so they could be grouped together as such and treated accordingly. For example, if a researcher is looking for the side effects of a new drug used for skin treatment on patients who has kidneys problems, then he has to match patients from the (Dermatology) and (kidney diseases) departments to find patients under these conditions. We need to allow such grouping at the subscriber side.
Further more, the subscriber is allowed to issue retrospective queries regarding some individual client, for example, update queries regarding the progress of treatment of certain patients. Subscribers (researchers) are considered curious in the sense they will try to determine the real identities of the individuals. Some information about individual identities might be leaked from their nonidentification information (i.e. eye color, age, weight, etc.) using statistical inference techniques. This is a separate problem that needs to be addressed with anonymization (i.e. kanonymity) or other sanitization methods, and is not considered in this work.

Hide the source of information (publishers and clients’ identities) from the subscribers.

Blindly determine record linkages among the encrypted publishers’ records and assign alternate random identifiers to the linked records before sharing them with the subscribers. The broker will just know the linkages without knowing the real identifiers.
4.2 Setup Phase
Each publisher from the set of N publishers \(D=\{d_i\}_{i \in [N]}\) has its secret key \(\mathsf {sk_i}\), and the broker which has a public key \(\mathsf {pk}\) and private key \(\mathsf {sk}\) pair. The setup phase is illustrated in Fig. 1 and works as follows.

Step 1: The broker broadcasts its public key \(\mathsf {pk}=(G,q,g,g^{\mathsf {sk}})\) for ElGamal homomorphic encryption to all publishers.

Step 2: Each publisher \(d_i\) generates an initial random secret \(r_i \xleftarrow {R} \{1,\cdots ,q1\}\), where q is the order of G, encrypts it using the master key \(\mathsf {pk}\). Let \(t_{i\rightarrow j}\) denote the temporary encrypted key converter of publisher i when being processed by publisher j, and \(t_{i\rightarrow final}\) the final converter when it gets back to the publisher i after being processed by all parties. Publisher \(d_i\) generates the initial temporary encrypted key converter \(t_{i\rightarrow i}=\mathsf {Enc_{pk}}(r_i^{1})\), then forwards it to the next publisher \(d_{(i \mod (N))+1}\).
 Step 3: Each publisher \(d_j\) receives a value \(t_{i\rightarrow j}\) from its upstream neighbor \(d_i\) (\(i \ne j\)), securely multiplies it using ElGamal homomorphic cryptosystem with its secret key \(\mathsf {sk_j}\) encrypted under the broker’s public key \(\mathsf {pk}\) as follows:This operation is repeated through \(N1\) publishers. The temporary value of publisher \(d_i\) generated by the \(j^\mathrm {th}\) publisher equals: \(t_{i\rightarrow j}= \mathsf {Enc_{pk}}(r_i^{1} \cdot \prod _{c=i+1}^{j} \mathsf {sk_{c}})\)$$\begin{aligned} t_{i\rightarrow i+1}&=t_{i\rightarrow i} \cdot \mathsf {Enc_{pk}(sk_{i+1}})\\&= \mathsf {Enc_{pk}}(r_i^{1} \cdot \mathsf {sk_{i+1}}) \end{aligned}$$
 Step 4: After \(N1\) transitions, the initials publisher receives the final key converter \(t_{i\rightarrow final}\) as$$ t_{i\rightarrow final} = \mathsf {Enc_{pk}}\left( r_i^{1} ~\prod _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} \mathsf {sk_j} \right) $$

Step 5: After each publisher is able to generate its key converter value after being processed by all other publishers, the publisher \(d_i\) sends its \(t_{i\rightarrow final}\) value to the broker, and saves a copy of it for future updates in case new publishers joined the system.

Step 6: For each \(t_{i\rightarrow final}\), the broker extracts the key conversion factor \(\delta _i\) such that: \( \delta _i = \mathsf {Dec_{sk}}(t_{i\rightarrow final}) = r_i^{1} ~\prod _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^{N} sk_j \)
Note: Careful readers might think that it would be simpler if publishers broadcast their keys and then locally compute the key converters. It is true that in term of communication overhead, this method also involves \(O(n^2)\) interactions, however, in term of information shared, it leaks more. In this solution, each publisher submits its key encrypted with the public key of the broker. If the broker can somehow eavesdrops the communication between the publishers, it can decrypt and obtains the key of all publishers.
4.3 Encryption of Query Results Phase
This phase is triggered by a query sent by a subscriber requesting information from publishers. This represents a data pull model; however, our scheme can be also used in a data push mode where publishers send data directly to the broker which redirects it to the corresponding subscribers. After executing the received query, each publisher encrypts the identifying parts of the query results using any cryptosystem that relies on DDH (Decisional DiffieHellman ) or DL (Discrete Logarithm) hardness assumptions, such as ElGamal cryptosystem. For performance reasons, our construction uses Elliptic Curves (EC) instead of Finite Groups of large primes as the underlying group for the used cryptosystem.
Each publisher has the publicly known ElGamal EC parameters, i.e., the curve parameters \( E (\mathbb {F}_q)\) and the point on the curve P of prime order n. The public/private key pair will be (\(r_i \cdot \mathsf {sk}_i\cdot P,r_i \cdot \mathsf {sk}_i\)) and both of th keys are kept secret. The message to be encrypted, id, in our multiplicative scheme needs to be a scalar. We denote by \(\mathsf {E(.)}\) the encryption of ElGamal based on EC.
Publishers can avoid having the broker store the key converter \((\delta _i)_{i \in [N]}\). For this purpose, each publisher encrypts the identifiers of the query results with a new random value \(r_i\), updates the key converter \(t_{i\rightarrow final}\), then sends these results to the broker accompanied with the new key converter. This solution adds negligible communication overhead, but ensures a zerokey stored on the broker side.
4.4 Secure Record Matching Phase
The broker receives the encrypted identifiers with different keys from different publishers. The broker’s job is to merge similar clients’ records from different publishers such that they will map to the same newly generated identifier. The broker will use the key converters \(\delta _i\) to change the encryption key in such a way that similar data will be deterministically encrypted with the same key without requiring any decryption to be performed along the way.
Conceptual representation of secure inverted indexes
4.5 Matching Phase WalkThrough
We now summarize all the interactions between the three parties, namely, publishers, subscribers and the broker to describe how privacy preserving record matching system works to serve the subscriber’s needs. These interactions schematically shown in Fig. 2, with each of following steps corresponding to the numbers shown in the figure.
 1.
The subscriber sends a query Q to the broker.
 2.
The broker sanitizes the query if necessary, and checks if this query Q is a query that seeks information about new clients or a retrospective query (requesting more information about an individual whose data has been seen before). If it is a new query, it forwards the query to publishers and wait for the answers.
 3.
If the query Q is a retrospective query, the broker first looks up the \(rid\Vert sid\) in the inverted index for the corresponding encrypted identifiers and their associated publisher identities, i.e. \(\mathsf {E_{r_i\cdot sk_i}}(H(id))  d_i\). The broker then replaces \(rid\Vert sid\) with the corresponding \(\mathsf {E_{r_i\cdot sk_i}}(H(id))\) and finally forwards the query Q only to their associated publishers \(d_i\).
 4.
For each query Q it receives, the publisher \(d_i\) checks if it is a new query or a retrospective query. If it is a retrospective query, the publisher can directly look up the matching records while for a new query a search has to be done to find the corresponding clients. In either case, the publisher applies its local sanitizing policies to the results, encrypts the identification part using its secret key (\(r_i \cdot \mathsf {sk}_i\)), and finally sends the results to the broker.
 5.Upon receiving the results of the forwarded query from the publishers, the broker further sanitizes the results according to the general sharing policies and regulations. Then it performs the secure record matching process, and updates its inverted indexes as follows:

Using the first inverted index, and for each distinct \(\mathsf {E_{r_i \cdot sk_i}}(H(id))\) in the query result, apply the key converters to get \(\mathsf {E}_{\Delta }(H(id))\). If a match is found then this identifier has been sent before (it might be from a different publisher though). So the broker retrieves the corresponding \(\{rid\Vert sid\}\), and updates the second inverted index with the encrypted id \(\mathsf {E_{r_i \cdot sk_i}}(H(id))d_i\) in case it has been previously sent from a different publisher.

If the converted encrypted \(\mathsf {E}_{\Delta }(H(id))\) is not found, then it means that this identifier has not been sent before by any publisher. The broker adds this converted value to the first inverted index, with a new subscriber and random identifier \(rid\Vert sid\). The second inverted index is updated accordingly.

The encrypted identifier \(\mathsf {E_{r_i \cdot sk_i}}(H(id))\) in each record in the results is then replaced with its corresponding \(rid\Vert sid\) before being sent to the corresponding subscribers.
As we have mentioned earlier, these inverted indexes are encrypted under the broker’s secret keys, and all the searching and update operations are performed on encrypted data using any adaptive secure searchable encryption scheme data structures.

5 Complexity and Security Analysis
Complexity: Our scheme is practicable and can be efficiently implemented in real scenarios. The construction is composed of three main steps, the setup, the encryption of query results and the secure record matching phase. The setup phase depends on the number of publishers N. The entire setup phase results in a total communication overhead which is \(O(N^2)\). To generate one key converter, the publishers needs to transfer one message each, while the computation is constant per each transfer. The total communication and computation overhead per publisher is in O(N). The setup phase is performed just once and is done in an offline manner. The encryption of query results overhead depends on the matching set and the data structure that the publisher is using for its own data storage. The search complexity depends on the query sent by the subscriber through the broker. Also the complexity of this phase greatly depends on the size of the database of the publishers and the sanitizing protocols used as well. For this reason, we consider only the communication overhead for this phase. For the last phase performed by the broker, given all identifiers that match the query, the goal consists of retrieving all subscribers identifiers as well as the randomized identifiers. Using SSE, the search is optimal and similar to plaintext data. Given an identifier, the search complexity will be in the number of matching results.
To sum up, the construction does not induce any overhead during the matching phase more than the one expected on plaintext data. The linear construction of the key converters is done once during the instantiation of the protocol.
Security: The key converter is based on ElGamal homomorphic encryption which is semantically secure. However, we want to make sure that the output of the key converter received by the broker will not leak any information about the secret keys of any publisher. The broker receives N key converters for N different publishers. Every key converter is composed of \(N1\) secret key and a random value. The broker then has, at the end of the setup phase, a linear system of N equations with 2N unknowns. This is an underdetermined system which is information theoretically secure. This property ensures that the broker cannot, based on the key converters, recover the secret keys of the publishers. Note that this is true as long as the publishers do not collude with the broker. We can enhance this key converter generation with much secure schemes such as multiparty computation where we can take into consideration malicious parties.
During the encryption of query results phase, every publisher is encrypting the records with a different secret key. Thus, even if a publisher eavesdrops over the communication between publishers and broker, it cannot infer anything about the clients identity.
For the storage of the encrypted identifiers at the broker side, a conversion of the encryption is necessary. The broker therefore knows that this encrypted identifier (whatever be the identifier) is linked to the same client in different publishers. This feature is very important for the scheme correctness since it enables the broker to aggregate the exact subscriber query results. From a security perspective, we want to hide not only the association between encrypted identifiers and subscribers but also the mapping between the publisher identifiers and the encrypted identifiers. For this purpose, we use a symmetric searchable encryption inverted index that enables to store securely these mappings. Consequently, even if the data at the broker is somehow leaked, the entire inverted indexes are encrypted and the mapping will not be disclosed.
6 Conclusions and Future Work
We have presented a novel scheme to allow multiple data publishers, who encrypt their data with their private nonshared key, to share their clients’ records with subscribers while protecting the identities of their clients. The scheme is based on partially homomorphic encryption to create “key converters” that allow a third party with limited trust (broker), to perform privacy preserving record matching. The role of the broker is to determine the linkages without identifying the real identities of the publishers’ clients but also to hide from the subscriber which publisher generated the information. Our construction is achieved by allowing each publisher to use his own secret key to guarantee the security of its data, and the broker, with the help of the key converters, to convert encrypted data of each publisher under different keys to data encrypted under the same key. Both the key converter and data conversion are implemented using ElGamal cryptosystem. Future work aims to provide an experimental evaluation with realworld clinical data and a practical implementation of this system for the clinical datasharing network called CHORDS (Colorado Health Outcome Research Data Services).
Notes
Acknowledgments
This work was partially supported by the U.S. National Science Foundation under Grant No. 0905232, and by Colorado State University under an internal research grant.
References
 1.Chase, M., Kamara, S.: Structured encryption and controlled disclosure. In: Abe, M. (ed.) ASIACRYPT 2010. LNCS, vol. 6477, pp. 577–594. Springer, Heidelberg (2010) CrossRefGoogle Scholar
 2.Curtmola, R., Garay, J.A., Kamara, S., Ostrovsky, R.: Searchable symmetric encryption: improved definitions and efficient constructions. In: Proceedings of the 13th ACM Conference on Computer and Communications Security, Alexandria, VA, USA. pp. 79–88 (2006)Google Scholar
 3.Moataz, T., Shikfa, A.: Boolean symmetric searchable encryption. In: Proceedings of the 8th ACM Symposium on Information, Computer and Communications Security, Hangzhou, China. pp. 265–276 (2013)Google Scholar
 4.Cash, D., Jarecki, S., Jutla, C., Krawczyk, H., Roşu, M.C., Steiner, M.: Highlyscalable searchable symmetric encryption with support for boolean queries. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013, Part I. LNCS, vol. 8042, pp. 353–373. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 5.Stefanov, E., van Dijk, M., Shi, E., Fletcher, C.W., Ren, L., Yu, X., Devadas, S.: Path ORAM: an extremely simple oblivious RAM protocol. In: Proceedings of ACM Conference on Computer and Communications Security, Berlin, Germany. 299–310 (2013)Google Scholar
 6.Strizhov, M., Ray, I.: Multikeyword similarity search over encrypted cloud data. In: CuppensBoulahia, N., Cuppens, F., Jajodia, S., Abou El Kalam, A., Sans, T. (eds.) SEC 2014. IFIP AICT, vol. 428, pp. 52–65. Springer, Heidelberg (2014) CrossRefGoogle Scholar
 7.DachmanSoled, D., Malkin, T., Raykova, M., Yung, M.: Efficient robust private set intersection. Int. J. Appl. Crypt. 2, 289–303 (2012)zbMATHMathSciNetCrossRefGoogle Scholar
 8.Kamara, S., Mohassel, P., Raykova, M., Sadeghian, S.: Scaling private set intersection to billionelement sets. In: Christin, N., SafaviNaini, R. (eds.) FC 2014. LNCS, vol. 8437, pp. 193–213. Springer, Heidelberg (2014) Google Scholar
 9.Goldreich, O.: Secure multiparty computation. Manuscript. Preliminary version (1998). http://citeseerx.ist.psu.edu. Accessed on 30 April 2015
 10.Agrawal, R., Evfimievski, A., Srikant, R.: Information sharing across private databases. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA, USA. pp. 86–97 (2003)Google Scholar
 11.De Cristofaro, E., Lu, Y., Tsudik, G.: Efficient techniques for privacypreserving sharing of sensitive information. In: McCune, J.M., Balacheff, B., Perrig, A., Sadeghi, A.R., Sasse, A., Beres, Y. (eds.) Trust 2011. LNCS, vol. 6740, pp. 239–253. Springer, Heidelberg (2011) CrossRefGoogle Scholar
 12.Boyd, A.D., Saxman, P.R., Hunscher, D.A., Smith, K.A., Morris, T.D., Kaston, M., Bayoff, F., Rogers, B., Hayes, P., Rajeev, N., KlineRogers, E., Eagle, K., Clauw, D., Greden, J.F., Green, L.A., Athey, B.D.: The University of Michigan honest broker: a webbased service for clinical and translational research and practice. J. Am. Med. Inform. Assoc. : JAMIA 16, 784–791 (2009)CrossRefGoogle Scholar
 13.Dhir, R., Patel, A.A., Winters, S., Bisceglia, M., Swanson, D., Aamodt, R., Becich, M.J.: A multidisciplinary approach to honest broker services for tissue banks and clinical data. Cancer 113, 1705–1715 (2008)CrossRefGoogle Scholar
 14.Jefferies, N., Mitchell, C.J., Walker, M.: A proposed architecture for trusted third party services. In: Proceedings of the International Conference on Cryptography: Policy and Algorithms, Brisbane, Queensland, Australia. pp. 98–104 (1995)Google Scholar
 15.Ajmani, S., Morris, R., Liskov, B.: A trusted thirdparty computation service (2001). http://www.pmg.lcs.mit.edu/~ajmani/papers/tep.ps. Accessed on 30 April 2015
 16.Yau, S., Yin, Y.: A privacy preserving repository for data integration across data sharing services. IEEE Trans. Serv. Comput. 1, 130–140 (2008)CrossRefGoogle Scholar
 17.Carbunar, B., Sion, R.: Toward private joins on outsourced data. IEEE Trans. Knowl. Data Eng. 24, 1699–1710 (2012)CrossRefGoogle Scholar
 18.Chow, S.S., Lee, J.H., Subramanian, L.: Twoparty computation model for privacypreserving queries over distributed databases. In: Proceedings of the 2009 Network and Distributed System Security Symposium, San Diego, CA, USA (2009)Google Scholar
 19.Tassa, T., Gudes, E.: Secure distributed computation of anonymized views of shared databases. ACM Trans. Database Syst. (TODS) 37, 11 (2012)CrossRefGoogle Scholar
 20.El Gamal, T.: A public key cryptosystem and a signature scheme based on discrete logarithms. In: Blakely, G.R., Chaum, D. (eds.) CRYPTO 1984. LNCS, vol. 196, pp. 10–18. Springer, Heidelberg (1985) CrossRefGoogle Scholar
 21.Koblitz, N.: Elliptic curve cryptosystems. Math. Comput. 48, 203–209 (1987)zbMATHMathSciNetCrossRefGoogle Scholar