1 Introduction

In the era of data explosion, individuals and enterprises need to store huge amounts of data. For example, IDC thinks that the global datasphere will reach 175 zettabytes by 2025 [1]. Faced with heavy storage burden, users would like to upload the data to the cloud server to release them from storing and managing such large-scale data. Nonetheless, it incurs a lot of duplicate data stored in cloud server because a lot of identical data may be uploaded to the cloud server by different users. The study shows that about 75% of data is duplicate in standard application system [2] and the proportion of identical data even reaches 90% in backup and archival storage system [3]. In order to reduce redundant data in cloud server, deduplication technology comes into being. Deduplication technology means that the cloud server only stores one duplicate for several identical data. This technology attracts wide attention since it can save a lot of financial cost of the cloud server and the user.

Data deduplication has a wide range of applications in various fields, such as cloud-assisted electronic health systems. Compared to traditional medical record management systems, cloud-assisted electronic health systems are more efficient, accurate and reliable in managing electronic medical records [4, 5]. In addition, cloud-assisted electronic health systems also play an important role in resolving judgements and disputes in medical malpractice [6, 7]. As we all know, the type of the diagnostic information such as symptoms and drugs in electronic medical records is limited. For example, there are only about 100 kinds of antibiotics in existence [8]. Accordingly, there exist a lot of the same data in electronic medical records. The study shows that data deduplication saves about 65% storage space in electronic health systems [9]. Data sharing of electronic medical records is very important for medical researchers to do research on certain diseases. For example, the COVID-19 electronic medical record contains some patient’s common symptoms like fever and dizziness. Sharing such electronic medical records is very helpful for researchers to study how to find the potential COVID-19 patients. Nevertheless, data sharing of electronic medical records might bring privacy-exposure problems. Generally, the electronic medical record is composed of two parts. The first part contains the sensitive information, such as patient’s name, age, ID number and phone number. The second part is the diagnostic information prescribed by the doctor, including the patient’s symptoms, the type of illness, the dose of medication and so on. Actually, only the diagnostic information is valuable to researchers, while researchers should not know patient’s sensitive information by data sharing services [10, 11]. Therefore, it is very important to achieve data sharing under the condition that data deduplication is performed efficiently and the sensitive information of shared electronic medical records is hidden.

1.1 Contribution

To cope with this problem, we design a secure data sharing scheme for cloud-assisted electronic medical systems in this paper. To protect the privacy and enhance the deduplication efficiency, we replace the patient’s sensitive information of electronic medical records by wildcards before encrypting whole electronic medical records. After that, the encrypted electronic medical records are uploaded to the cloud server so that it could not know anything useful about the electronic medical records. Any authorized researcher can decrypt and obtain the electronic medical records under the condition that the sensitive information of shared electronic medical records is hidden. Since the sensitive information is uniformly blinded with wildcards, the duplicate ratio of blinded sensitive information becomes higher. As a result, our scheme greatly improves the deduplication efficiency remarkably. Moreover, we clarify the diagnose information of the electronic medical records into three types according to duplicate ratios: high duplicate ratio, intermediate duplicate ratio and low duplicate ratio. The authorized researchers can selectively download data according to duplicate ratio of diagnostic information. Furthermore, the security of the key might be a bottleneck of the system. If the key server is compromised, the key may be leaked. In order to improve the security of the key, we employ proactive secret sharing technique [12] to resist brute-force attacks and single-point-of-failure attack. Security analysis shows that the proposed scheme ensures the confidentiality of shared electronic medical records. On the one hand, the sensitive information in the electronic medical records is confidential to researchers. Meanwhile, the integrity of the electronic medical records can be guaranteed. We also conduct experiments to evaluate the performance of the proposed scheme according to deduplication efficiency, storage efficiency, computational costs and computation delay.

1.1.1 Organization

The rest organization of the paper is as follows. In Sect. 2, we introduce the related work. In the next section, we show some necessary preliminaries. Section 4 gives the system model and the security model. We describe our secure data sharing scheme in Sect. 5 and give the security analysis in Sect. 6. Performance evaluation is shown in Sect. 7. In the last section, we conclude this paper.

2 Related work

With the security awareness increasing, more and more users would like to encrypt their data before the data are uploaded to the cloud server. Unfortunately, even if the identical data is encrypted, different users can produce different ciphertexts since they use different keys to encrypt the identical data. It leads to the cloud server unable to perform deduplication according to ciphertext of data. To cope with the problem of deduplication over encrypted data and improve the storage efficiency [13], convergent encryption (CE) has been proposed [14]. It requires that the encryption key is derived from the data itself. Bellare et al. [15] formalized a new primitive named as message-locked encryption (MLE) and proved that the MLE scheme is secure. Moreover, many data deduplication schemes based on convergent encryption and message-locked encryption have been proposed [16,17,18,19]. Unfortunately, these MLE-based and CE-based schemes cannot be against brute-force attacks [20]. Bellare et al. [21] proposed the server-aided deduplication scheme called DupLESS which introduces a fully trusted key server to produce MLE keys. After that, many deduplication schemes have been proposed to resist brute-force attacks [22,23,24,25], where the key server is fully trusted and will not be attacked by any adversary. Thereby, the completely trusted key server becomes single point of failure. More specifically, it is infeasible to introduce a completely trusted key server. It is hard to be equipped with such a fully trusted key server in reality. If the key server is broken down, the secret stored in the key server is exposed. In other words, attackers only need to compromise one key server if they want to obtain the server-side secret. To resist single-point-of-failure attack, Zhang et al. [26] and Duan et al. [27] proposed to store the secret by multiple key servers using a threshold secret sharing [28]. Zhang et al. [12] improved the scheme of multiple key servers and proposed to replace some new key servers periodically. Let the new key servers store new secret shares while share the same server-side secret. It improves the security of server-side secret. As mentioned in [15], there exists a duplicate faking attack which causes legitimate users unable to obtain the correct data. More specifically, if attackers upload the tag of \(m\) but upload the ciphertext of \({m}^{*}\) to the cloud server, it will make the tag inconsistent with the ciphertext \({m}^{*}\). As a result, the subsequent users who upload the tag of \(m\) will download the ciphertext of error data \({m}^{*}\). To solve this problem, tag consistency is considered to ensure data integrity [18, 29, 30]. Specifically, users verify tag consistency after they decrypt the returned ciphertext. If the verification fails, users cannot determine the cause of the failure. It may be because the uploaded data is fake, or the cloud server has destroyed the data. Li et al. [31] and Yang et al. [32] proposed that the tag is generated directly from the ciphertext, while the cloud server checks tag consistency in the phase of data upload. As mentioned above, convergent encryption (CE) cannot be against brute-force attacks for predictable messages. Hence, data popularity is considered to preserve the privacy of the data [16]. More specifically, a data block is considered “popular” when it is owned by more than the popularity threshold number of users, otherwise it is considered “unpopular”. Data blocks with different popularity are protected under encryption mechanisms with different security levels. Puzio et al. [33] proposed PerfectDedup scheme which takes advantage of the characteristics of perfect hash to ensure data confidentiality. In addition, Li et al. [34] introduced the concept of transparent integrity auditing which can keep the cloud server from misbehaving. In order to audit the integrity of user’s data without downloading them, Shao et al. [35] propose an efficient TPA-based auditing scheme for secure cloud storage.

As we all know, the electronic medical record is composed of patient’s sensitive information and diagnostic information. In general, the duplicate ratio of the sensitive information is low, while the duplicate ratio of the diagnostic information is high. Based on this feature of the electronic medical record, Zhang et al. [26] designed a deduplication scheme that only performs deduplication for diagnostic information. Nonetheless, the sensitive information is all stored in the cloud server which increases the storage costs of the cloud server. Furthermore, this scheme doesn’t support data sharing. Shen et al. [36] designed a scheme for data sharing with sensitive information hiding. Although the sensitive information is hidden for researchers, the sensitive information is stored repeatedly. Because this scheme does not support deduplication, the storage efficiency of this scheme is not high.

As far as we know, all above schemes cannot achieve secure data sharing supporting efficient deduplication and sensitive information hiding. In this paper, we explore how to achieve secure data sharing with sensitive information hiding, meanwhile we try to improve the deduplication efficiency.

3 Preliminaries

In this section, we introduce the basic knowledge needed in this paper, including MLE, bilinear map, discrete logarithm problem and computational Diffie-Hellman problem.

3.1 Message-locked encryption

Message-locked encryption (MLE) is a special symmetric encryption, which is widely adopted in deduplication. The encryption key and decryption key are computed from messages themselves. The same message will correspond the same key and the same ciphertext no matter who runs this encryption method. MLE can be expressed as a tuple MLE = (P, K, E, D) containing four algorithms:

  • \( P\leftarrow{\$ }{\text{P}} \): It is parameter generation algorithm. It outputs a public parameter P which is published to all users.

  • \(K\leftarrow \mathrm{K}\left(P,M\right):\) It is key generation algorithm. It inputs the public parameter P and the message M, and outputs the MLE key K. The same message M produces the same MLE key K.

  • \(C\leftarrow \mathrm{E}(P,K,M)\): It is encryption algorithm. It inputs the public parameter P, the MLE key K and the message M, and outputs the ciphertext C.

  • \(M\leftarrow \mathrm{D}(P,K,C)\): It is decryption algorithm. It inputs the public parameter P, the MLE key K and the ciphertext C, and outputs the plaintext M.

Any attacker cannot distinguish between ciphertext generated by MLE and random string of unpredictable information except with negligible possibility [15]. A special manifestation of MLE is convergent encryption (CE), in which the key is the hash value of the message.

3.2 Bilinear map

Let G and \({G}_{T}\) be an addictive cycle group and a multiplicative cycle group of large prime order \(p\), respectively. Let P be a generator of G. Bilinear Map is a map \(e:G\times G\to {G}_{T}\), and its properties are as follows:

  1. (1)

    Bilinearity: for all \(a,b\in {Z}_{p}^{*},\) and \(A,B\in G,\) we have \(e\left(aA,bB\right)={e(A,B)}^{ab}\).

  2. (2)

    Nondegeneracy: for all \(A,B\in G,\) and \(A\ne B, e(A,B)\ne 1\).

  3. (3)

    Computability: it is easy to calculate the map \(e\).

3.3 Discrete logarithm (DL) problem

Let G be an addictive cycle group of large prime order \(p\) and P be its generator. For an unknown \(x\in {Z}_{p}^{*},\) given \(xP\) as input, the aim is to output \(x.\) The DL assumption is true only if no algorithm can output \(x\) in polynomial time.

3.4 Computational Diffie–Hellman (CDH) problem

Let G be an addictive cycle group of large prime order \(p\) and P be its generator. For unknown \(x,y\in {Z}_{p}^{*}\), given \((P,xP,yP)\) as input, the aim is to output \(xyP\). The CDH assumption is true only if no algorithm can output \(xyP\) in polynomial time.

4 System model and secure model

4.1 System model

The system model involves four entities: hospital, cloud server, key servers and researchers, as shown in Fig. 1.

  • Hospital: The hospital is a fully trusted entity. The hospital produces huge amounts of electronic medical records every day that need to be moved to the cloud server for sharing with researchers. To ensure the security of the electronic medical records, the hospital needs to interact with key servers to produce the MLE key which is used to encrypt these electronic medical records. To protect the electronic medical record, the hospital will not collude with any attacker. If a researcher wants to obtain the electronic medical records, he/she firstly needs to get authorization from the hospital.

  • Cloud server: The cloud server owns huge storage space and powerful computing ability. Meanwhile, the cloud server has the functions of identifying identity and performing deduplication. It also calculates and updates the statistical information of electronic medical records in real time. Besides, the hospital shares the electronic medical records with researchers through the cloud server under the condition of the privacy of patients being protected.

  • Key servers: Key servers are independent of the hospital, researchers and the cloud server. They use \((t,n)\)-threshold secret sharing to share a constant server-side secret, which is used to compute the MLE key. Each key server stores a share of the server-side secret. The hospital needs to generate the MLE key by interacting with the \(n\) key servers. These \(n\) key servers are not fully trusted, which means that one or more of them may be compromised.

  • Researchers: Researchers need to obtain authorization from the hospital and then download electronic medical records from the cloud server for their research.

Fig. 1
figure 1

System model

4.2 Adversary model

We mainly consider two kinds of adversaries: internal adversaries and external adversaries in the adversary model.

4.2.1 Internal adversaries

  • The cloud server: Similar to the existing literature for cloud computing security [29, 37, 38], we assume that the cloud server will perform its task honestly for the sake of its reputation, but it is curious about the electronic medical records uploaded by the hospital and tries to obtain the plaintext of the electronic medical records.

  • Researchers: We assume that authorized researchers will not collude with the hospital, the cloud server, unauthorized researchers and adversaries. Authorized researchers can download electronic medical records from the cloud server and obtain disease-related information in electronic medical records. Meanwhile, researchers are curious about sensitive information of patients, such as the patient's name and phone number.

  • The compromised key servers: Adversaries may extract the secret shares from the compromised key servers and then launch brute-force attacks to guess the plaintext of the electronic medical records. Once the server-side secret is exposed, the security of electronic medical records will be difficult to guarantee.

  • The hospital: When the hospital uploads data to the cloud server, data error may happen. For example, after uploading a tag, the hospital uploads the incomplete data or other non-corresponding data. It will lead to the inconsistency between the data and the tag. If the cloud server stores error data, it will not be able to rectify the data in the future since the cloud server will perform deduplication and only store one copy. As a result, the error data will affect the reliability of data seriously and further mislead the researchers.

4.2.2 External adversaries

We assume that external adversaries can obtain valuable information by monitor the communication between various entities. Based on this information, they try to recover the plaintext of the electronic medical records.

4.3 Design goals

In order to resist the above adversaries, we list the goals the proposed scheme should achieve.

  • Data confidentiality: Data confidentiality is a fundamental requirement of the deduplication scheme. Due to the particularity of electronic medical records, the cloud server and unauthorized researchers cannot obtain the plaintext of electronic medical records. Moreover, even authorized researchers cannot obtain the sensitive information of patients in electronic medical records.

  • Resistance to brute-force attacks and single-point-of-failure attack: Even if brute-force attacks are launched, any adversary cannot obtain valuable information. Moreover, if an adversary attacks a key server and extracts the server-side secret share from this key server, he/she also cannot recover the MLE key.

  • Data integrity: Both the cloud server and authorized researchers can check the integrity of data. Moreover, if the data downloaded by the authorized researchers is error, the researchers can determine if the error data is from duplicate faking attacks or destroyed by the cloud server.

  • Downloading selectivity: The authorized researchers can download electronic medical records selectively according to duplicate ratio.

  • Efficiency: On the premise of ensuring security, the efficiency is our goal, including deduplication efficiency, computing efficiency and storage efficiency.

5 The proposed scheme

5.1 Overview

Electronic medical records contain disease-related information which is very helpful for researchers to do research on the disease. Hence, it is necessary for hospital to share electronic medical records with researchers. However, electronic medical records also contain patient’s sensitive information that should not be exposed to researchers. In order to hide the patient’s sensitive information, we design a secure data sharing scheme. Before uploading the electronic medical record, the hospital firstly blinds the sensitive information of the electronic medical record with wildcards. Then the hospital interacts with the key servers to generate the MLE key. The hospital divides the electronic medical record into multiple data blocks, and then encrypts them with the corresponding MLE keys to obtain the ciphertexts. Compared with file-level deduplication, chunk-level deduplication has higher deduplication efficiency. The hospital derives a tag from the ciphertext of a data block, which can be used for duplicated data detection and data integrity verification. The hospital sends the tag to the cloud server. The cloud server checks whether the data block is duplicated with this tag. If the data block is not duplicated, it means that the hospital needs to upload the data block. In this way, anybody cannot obtain patient’s sensitive information from the electronic medical records downloaded from the cloud server. After receiving the ciphertext of the data block, the cloud server makes deduplication operation to save storage space. Since the sensitive information of electronic medical records is uniformly blinded with wildcards, the duplicate ratio of the sensitive information is higher in our scheme. In addition to performing deduplication, the cloud server calculates and updates the duplicate ratio of the diagnose information of electronic medical records. We divide the diagnose information of electronic medical records into three categories according to the duplicate ratio: high duplicate ratio, intermediate duplicate ratio and low duplicate ratio. In this way, the authorized researchers can download electronic medical records selectively according to duplicate ratio. For example, the COVID-19 electronic medical records contain multiple clinical symptoms. Some symptoms like fever and cough are very common but some symptoms like hypoglycemia only appear in a few electronic medical records. Symptoms with different duplicate ratio have different research values for researchers. If researchers want to research the common symptoms of COVID-19, they can choose to only download the electronic medical records with high duplicate ratio. On the one hand, it reduces the interference of the electronic medical records with low duplicate ratio to researchers. On the other hand, it reduces communication costs since researchers do not need to download all electronic medical records.

5.2 Construction of our scheme

5.2.1 System setup

  1. a)

    The system chooses an additive cycle group \(G\) and a multiplicative cycle group \({G}_{T}\) of prime order \(p\). \(P\) is a generator of \(G\), and \(e:G\times G\to {G}_{T}\) is a bilinear map. The system publishes parameters \((G,{G}_{T},p,P,e)\).

  2. b)

    The hospital \(HP\) selects an anti-collision hash function \(H: {\{\mathrm{0,1}\}}^{*}\to G\).

  3. c)

    The hospital \(HP\) randomly chooses a master key \(msk.\)

  4. d)

    The hospital \(HP\) chooses a symmetric encryption algorithm \(E(\bullet )\) and a public-key encryption algorithm \(Enc(\bullet )\).

5.2.2 Sensitive information hiding

  1. a)

    After generating an electronic medical record \(F\), the hospital \(HP\) firstly replaces patient’s sensitive information in the electronic medical record \(F\) with wildcards “*”. We show an example for electronic medical record in Fig. 2. We denote the new file after this process as \({F}^{*}\).

  2. b)

    The hospital \(HP\) divides \({F}^{*}\) into multiple fixed size blocks \(({m}_{1}^{*},{m}_{2}^{*},\dots ,{m}_{l}^{*})\), where \(l\) represents the number of data blocks.

Fig. 2
figure 2

An example of the processed electronic medical record

5.2.3 Key servers initialization

  1. (a)

    Assume \(\{{KS}_{1},{KS}_{2},\dots ,{KS}_{n}\}\) is the set of \(n\) mutually independent key servers. We divide the total time into multiple fixed length intervals called epochs.

  2. (b)

    For \(k=\mathrm{1,2},\dots ,n, {KS}_{k}\) randomly selects \(t\) elements \({a}_{k,0},{a}_{k,1},{a}_{k,2},\dots ,{a}_{k,t-1}\in {Z}_{p}^{*},\) and constructs a polynomial \({F}_{k}\left(x\right)={a}_{k,0}+{a}_{k,1}x+{a}_{k,2}{x}^{2}+\dots +{a}_{k,t-1}{x}^{t-1} mod p.\)

  3. (c)

    \({KS}_{k}\) calculates \({F}_{k}(j)\) and sends \({F}_{k}(j)\) to \({KS}_{j}\) secretly, for \(j=\mathrm{1,2},3,\dots ,n and j\ne k\). \({KS}_{k}\) calculates commit set \(\{{a}_{k,0}P,{a}_{k,1}P,{a}_{k,2}P,\dots ,{a}_{k,t-1}P\}\) and publishes them to all other key servers.

  4. (d)

    After receiving \({F}_{k}\left(j\right),\) \({KS}_{j}\) needs to verify the validity of \({F}_{k}(j)\) by checking whether \({F}_{k}\left(j\right)P={\sum }_{\xi =0}^{t-1}{k}^{\xi }\cdot {a}_{k,\xi }P\) holds. If the equation holds, \({KS}_{j}\) accepts \({F}_{k}(j)\), otherwise rejects \({F}_{j}(k)\).

  5. (e)

    When receiving \(\left(n-1\right)\) correct \({F}_{k}\left(j\right)\) from \({KS}_{k}\left(k=\mathrm{1,2},3,\dots ,n;k\ne j\right),\) \({KS}_{j}\) calculates its secret share \({s}_{j}={F}_{1}\left(j\right)+{F}_{2}\left(j\right)+{F}_{3}\left(j\right)+\dots +{F}_{n}(j)\).

  6. (f)

    Let \(s={a}_{\mathrm{1,0}}+{a}_{\mathrm{2,0}}+{a}_{\mathrm{3,0}}+\dots +{a}_{n,0}.\) According to Lagrange interpolation, we know \(s={\sum }_{\varsigma =1}^{t}{\omega }_{\varsigma }\cdot {s}_{\varsigma },\) where \({\omega }_{\varsigma }=\prod_{\begin{array}{c}1\le \eta \le t\\ \eta \ne \varsigma \end{array}}\frac{\eta }{\eta -\varsigma }.\)

  7. (g)

    \({KS}_{j}\) calculates the commit \({Q}_{j}={s}_{j}\bullet P\) for share \({s}_{j}\) and the public value \(Q={\sum }_{\epsilon =1}^{n}{a}_{\epsilon ,0}\bullet P\).

  8. (h)

    Without loss of generality, we assume that the key servers in the \(\mathcal{X}\)-th epoch are \({KS}^{(\mathcal{X})}=\left\{{KS}_{1}^{(\mathcal{X})},{KS}_{2}^{(\mathcal{X})},{KS}_{3}^{(\mathcal{X})},\dots ,{KS}_{n}^{(\mathcal{X})}\right\},\) and the key servers in the \(\left(\mathcal{X}+1\right)\)-th epoch are \({KS}^{(\mathcal{X}+1)}=\{{KS}_{1}^{(\mathcal{X}+1)},{KS}_{2}^{(\mathcal{X}+1)},{KS}_{3}^{(\mathcal{X}+1)},\dots ,{KS}_{n}^{(\mathcal{X}+1)}\}\). Firstly, \(t\) honest and reliable key servers are selected from \({KS}^{(\mathcal{X})}\), expressed as \(\{{KS}_{i1}^{(\mathcal{X})},{KS}_{i2}^{(\mathcal{X})},{KS}_{i3}^{(\mathcal{X})},\dots ,{KS}_{it}^{(\mathcal{X})}\}\). The corresponding secret shares are \(\{{s}_{i1}^{(\mathcal{X})},{s}_{i2}^{(\mathcal{X})},{s}_{i3}^{(\mathcal{X})},\dots ,{s}_{it}^{(\mathcal{X})}\}\). For \(\alpha =\mathrm{1,2},3,\dots ,t,\) \({Q}_{i\alpha }^{(\mathcal{X})}={s}_{i\alpha }^{(\mathcal{X})}\cdot \) P has been published.

  9. (i)

    \({KS}_{i\alpha }^{(\mathcal{X})}\) randomly selects \(\left(t-1\right)\) elements \({b}_{i\alpha ,1},{b}_{i\alpha ,2},{b}_{i\alpha ,3},\dots ,{b}_{i\alpha ,t-1},\) and constructs a function \({g}_{i\alpha }^{(\mathcal{X})}\left(x\right)={s}_{i\alpha }^{\mathcal{X}}+{b}_{i\alpha ,1}\cdot x+{b}_{i\alpha ,2}\cdot {x}^{2}+\dots +{b}_{i\alpha ,t-1}\cdot {x}^{t-1} mod p.\) Here, \({KS}_{i\alpha }^{(\mathcal{X})}\) computes and publishes commits \({b}_{i\alpha ,1}\cdot P,{b}_{i\alpha ,2}\cdot P,{b}_{i\alpha ,3}\cdot P,\dots , and {b}_{i\alpha ,t-1}\cdot P,\) and computes \({s}_{i\alpha ,\beta }^{(\mathcal{X})}={g}_{i\alpha }^{(\mathcal{X})}(\beta )\) for \(\beta =\mathrm{1,2},3,\dots ,n.\)

  10. (j)

    \({KS}_{i\alpha }^{(\mathcal{X})}\) sends \({s}_{i\alpha ,\beta }^{(\mathcal{X})}\) to \({KS}_{\beta }^{(\mathcal{X}+1)}\) through a secure channel. Then \({KS}_{\beta }^{(\mathcal{X}+1)}\) checks the validity of \({s}_{i\alpha ,\beta }^{(\mathcal{X})}\) by verifying whether \({s}_{i\alpha ,\beta }^{(\mathcal{X})}\cdot P={s}_{i\alpha }^{(\mathcal{X})}\cdot P+{\sum }_{\gamma =1}^{t-1}{\beta }^{\gamma }\cdot {b}_{i\alpha ,\gamma }\cdot P\) holds. If the checking failed, \({KS}_{\beta }^{(\mathcal{X}+1)}\) aborts; otherwise, \({KS}_{\beta }^{(\mathcal{X}+1)}\) sends “Accept” to all other key servers in the \(\left(\mathcal{X}+1\right)\)-th epoch.

  11. (k)

    After getting all “Accept” from other key servers, \({KS}_{\beta }^{(\mathcal{X}+1)}\) has received all correct shares \({s}_{i1,\beta }^{(\mathcal{X})},{s}_{i2,\beta }^{(\mathcal{X})},{s}_{i3,\beta }^{(\mathcal{X})},\dots ,{s}_{it,\beta }^{(\mathcal{X})}.\) \({KS}_{\beta }^{(\mathcal{X}+1)}\) computes its secret share \({s}_{\beta }^{(\mathcal{X}+1)}={\sum }_{\varsigma =1}^{t}{\omega }_{i\varsigma }\cdot {s}_{i\varsigma ,\beta }^{(\mathcal{X})},\) where \({\omega }_{i\varsigma }={\prod }_{\begin{array}{c}1\le \eta \le t\\ \eta \ne i\varsigma \end{array}}\frac{\eta }{\eta -i\varsigma }\). Finally, \({KS}_{\beta }^{(\mathcal{X}+1)}\) computes and publishes public commit \({Q}_{\beta }^{(\mathcal{X}+1)}={s}_{\beta }^{(\mathcal{X}+1)}\cdot P.\)

Remark 1

Although key servers have been replaced by new key servers and the secret shares stored in the new key servers have changed in the end of each epoch, the server-side secret \(s\) shared by the new key servers has not changed. As we know, in the \(\mathcal{X}\)-th epoch, \(s={\sum }_{\varsigma =1}^{t}{\omega }_{\varsigma }\cdot {s}_{\varsigma }^{(\mathcal{X})}={\sum }_{\varsigma =1}^{t}{\omega }_{\varsigma }\cdot {\sum }_{\beta =1}^{t}{\omega }_{\beta }^{\prime}\cdot {s}_{\varsigma ,\beta }^{\left(\mathcal{X}\right)}={\sum }_{\beta =1}^{t}{\sum }_{\varsigma =1}^{t}{\omega }_{\varsigma }\cdot {\omega }_{\beta }^{\prime}\cdot {s}_{\varsigma ,\beta }^{(\mathcal{X})}={\sum }_{\beta =1}^{t}{\omega }_{\beta }^{\prime}\cdot {s}_{\beta }^{(\mathcal{X}+1)}\), where \({\omega }_{\varsigma }\) and \({\omega }_{\beta }^{\prime}\) are Lagrange coefficients. In the \((\mathcal{X}+1)\)-th epoch, \({s}^{\prime}={\sum }_{\beta =1}^{t}{\omega }_{\beta }^{\prime}\cdot {s}_{\beta }^{(\mathcal{X}+1)}\). So, \(s={s}^{\prime}\), which means that the server-side secret \(s\) has not changed.

5.2.4 MLE key generation

  1. (a)

    The hospital \(HP\) randomly selects an element \(r\in {Z}_{p}^{*},\) and computes \({m}_{i}^{\prime}=r\cdot H({m}_{i}^{*})\) for \(i=\mathrm{1,2},3,\dots ,n.\) Then \(HP\) sends \({m}_{i}^{\prime}\) to all key servers.

  2. (b)

    For \(k=\mathrm{1,2},3,\dots ,n,\) \({KS}_{k}\) computes \({\sigma }_{i}^{(k)}={s}_{k}\cdot {m}_{i}^{\prime}\) and returns \({\sigma }_{i}^{(k)}\) to \(HP\).

  3. (c)

    \(HP\) checks the validity of \({\sigma }_{i}^{(k)}\) by verifying whether \(e\left({\sigma }_{i}^{\left(k\right)},P\right)=e({m}_{i}^{\prime},{Q}_{k})\) holds. If the equation doesn’t hold, \(HP\) rejects \({\sigma }_{i}^{(k)}\).

  4. (d)

    After receiving \(t\) effective \(\{{\sigma }_{i}^{\left(i1\right)},{\sigma }_{i}^{\left(i2\right)},{\sigma }_{i}^{\left(i3\right)},\dots ,{\sigma }_{i}^{\left(it\right)}\}\), where \(i1,i2,i3,\dots ,it\in \left[1,n\right],\) \(HP\) computes \({\sigma }_{i}={r}^{-1}\cdot {\sum }_{\delta =i1}^{it}{\omega }_{\delta }\cdot {\sigma }_{i}^{\delta }=s\cdot H({m}_{i}^{*})\).

  5. (e)

    \(HP\) computes the MLE key of the data block \({m}_{i}^{*}\) as \({K}_{i}=H({\sigma }_{i})\).

Remark 2

Each hospital will select a different element \(r\) to compute different \({m}_{i}^{\prime}\). Thus, the key server cannot determine whether the data block \({m}_{i}^{*}\) is identical or not according to \({m}_{i}^{\prime}\). Since \({\sigma }_{i}=s\cdot H({m}_{i}^{*})\), \({\sigma }_{i}\) is only related to the server-side secret \(s\) and the data block \({m}_{i}^{*}\). As we know, the server-side secret \(s\) is constant. If the data block \({m}_{i}^{*}\) is identical, each hospital will produce the same \({\sigma }_{i}\). Since \({K}_{i}=H({\sigma }_{i})\), each hospital will obtain the same MLE key \({K}_{i}\). Each hospital encrypts the same data block and computes the data tag with the same MLE key to obtain the same data ciphertext and the same data tag. Therefore, the cloud server can perform data deduplication on the data uploaded by different hospitals.

5.2.5 Data upload and deduplication

The sensitive information in electronic medical records has high duplicate ratio after it is uniformly blinded with wildcards. Hence, the category of sensitive information is worthless for researchers. In order to exclude the influence of the category of sensitive information, we design different upload processes for sensitive information and diagnostic information.

5.2.5.1 Sensitive information
  1. (a)

    After getting MLE keys, \(HP\) firstly encrypts the data block \({m}_{i}^{*}\) for \(i=\mathrm{1,2},3,\dots ,n\) as \({C}_{i}=E({K}_{i},{m}_{i}^{*})\). Then \(HP\) computes \({\tau }_{i}=H({C}_{i})\) as the tag of the data block \({m}_{i}^{*}\). Finally, \(HP\) sends these \({\tau }_{i}\) to the cloud server \(CS\).

  2. (b)

    \(CS\) maintains a set \(\left\{\tau \right\}\) for data blocks of sensitive information. After receiving the tag \({\tau }_{i}\), \(CS\) checks whether \({\tau }_{i}\) exists in the set \(\left\{\tau \right\}\). If \(CS\) finds the tag \({\tau }_{i}\), then aborts. Otherwise, goes to the next step.

  3. (c)

    \(HP\) calculates the ciphertext \({CK}_{i}=E(msk,{K}_{i})\) of the MLE key, and then uploads \(({C}_{i},{CK}_{i})\) to \(CS\).

  4. (d)

    After receiving \(({C}_{i},{CK}_{i})\), \(CS\) verifies the integrity of \({C}_{i}\) through checking \(H\left({C}_{i}\right)={\tau }_{i}\). If the verification passes, \(CS\) stores \(({C}_{i},{CK}_{i})\), otherwise \(CS\) aborts.

5.2.5.2 Diagnostic information
  1. (a)

    After getting MLE keys, \(HP\) firstly encrypts the data block \({m}_{i}^{*}\) for \(i=\mathrm{1,2},3,\dots ,n\) as \({C}_{i}=E({K}_{i},{m}_{i}^{*})\). Then \(HP\) computes \({\tau }_{i}=H({C}_{i})\) as the tag of the data block \({m}_{i}^{*}\). Finally, \(HP\) sends these \({\tau }_{i}\) to the cloud server \(CS\).

  2. (b)

    \(CS\) maintains a tuple \(T=\left(\tau ,\rho ,\lambda ,\varphi \right)\) for each unique data block of diagnostic information, where \(\tau \) is the tag of the data block, \(\rho \) is the number of duplicates of the data block, \(\lambda \) is the duplicate ratio of the data block in all files, and \(\varphi \) is the category of the data block. The data blocks are divided into three categories: low duplicate ratio, intermediate duplicate ratio and high duplicate ratio which are represented by − 1, 0, 1 respectively. \(CS\) maintains a value \(f\) to record the number of files stored currently. \(CS\) sets two thresholds \({t}^{\prime}\) and \({t}^{\prime\prime}\) to distinguish the categories of data blocks.

  3. (c)

    After receiving the tag \({\tau }_{i}\), \(CS\) checks whether \({\tau }_{i}\) exists in the tuple set \(\{T\}\). If \(CS\) doesn’t find \({\tau }_{i}\), \(HP\) needs to perform the next step d). otherwise, the step d) g) and h) will be skipped.

  4. (d)

    \(CS\) constructs a tuple \({T}_{i}=({\tau }_{i},{\rho }_{i},{\lambda }_{i},{\varphi }_{i})\) for \({\tau }_{i}\) and sets \({\tau }_{i}={\rho }_{i}={\lambda }_{i}=0,{\varphi }_{i}=-1\).

  5. (e)

    For each received tag \({\tau }_{i}\), \(CS\) sets \({\rho }_{i}={\rho }_{i}+1\). After receiving all the tags of a file, \(CS\) sets \(f=f+1\) and calculates the duplicate ratio of each data block \({\lambda }_{i}=\frac{{\rho }_{i}}{f}\).

  6. (f)

    \(CS\) updates the categories based on the duplicate ratio of data blocks. If \({\lambda }_{i}<{t}^{\prime},\) the cloud server sets \({\varphi }_{i}=-1;\) if \({t}^{\prime}\le {\lambda }_{i}<{t}^{\prime\prime}\), \(CS\) sets \({\varphi }_{i}=0;\) if \({\lambda }_{i}\ge {t}^{\prime\prime},\) \(CS\) sets \({\varphi }_{i}=1.\)

  7. (g)

    \(HP\) calculates the ciphertext of the MLE key \({CK}_{i}=E(msk,{K}_{i})\), and then uploads \(({C}_{i},{CK}_{i})\) to \(CS\).

  8. (h)

    After receiving \(({C}_{i},{CK}_{i})\), \(CS\) checks the integrity of \({C}_{i}\) through verifying whether \(H\left({C}_{i}\right)={\tau }_{i}\) holds. If the verification passes, \(CS\) stores \(({C}_{i},{CK}_{i})\), otherwise \(CS\) aborts.

5.2.6 Authorization and data download

The hospital \(HP\) randomly selects an element \(x\in {Z}_{p}^{*},\) calculates and publishes\(xP\). The researcher \(R\) randomly selects an element\(y\in {Z}_{p}^{*}\), calculates and publishes \(yP .\)

  1. (a)

    \(R\) interacts with \(HP\) to obtain authorization to access the electronic medical records. \(HP\) calculates \(HPR=msk\cdot x\bullet \left(yP\right)\) and then sends \((HPR,\{\tau \})\) to \(R\).

  2. (b)

    \(R\) calculates \(msk=HPR\bullet {\left(y\bullet xP\right)}^{-1}\) and then sends \(\{\tau \}\) and \(\{\varphi \}\) to \(CS\) where \(\varphi \) is data category. For example, if \(R\) wants to download the electronic medical records with the category of very common, \(R\) sets \(\varphi =1\) and sends \(\varphi \) to \(CS\).

  3. (c)

    After receiving \(\{\tau \}\) and \(\{\varphi \}\), \(CS\) firstly finds out the tags \(\left\{\tau \right\}\) that match the categories \(\{\varphi \}\) according to the tuple \(T\). Then \(CS\) picks out the electronic medical records that contain the tags \(\left\{\tau \right\}\). Finally, \(CS\) returns the ciphertexts \(\{\left(C,CK\right)\}\) of the data blocks that make up these electronic medical records to \(R.\)

  4. (d)

    \(R\) verifies the integrity of \(C\) by checking \(H\left(C\right)=\tau \). If the verification fails, \(R\) aborts. Otherwise, \(R\) calculates \(K=D(msk,CK)\) and \({m}^{*}=D(K,C)\).

6 Security analysis

We analyze the security of our proposed scheme from the aspects including resistance to brute-force attacks and single-point-of-failure attack, data confidentiality and data integrity.

6.1 Resistance to Brute-force attacks and single-point-of-failure attack

First, we analyze that attacker cannot launch brute-force attacks to obtain the plaintext information of data. Specifically, for a given ciphertext of data block \({C}_{i}\), the attacker wants to determine which ciphertext of data block \({C}_{k}^{*}\) corresponds to the ciphertext \({C}_{i}\). More specifically, for each candidate data block \({m}_{k}^{*}\), the attacker calculates its corresponding ciphertext \({C}_{k}\), and then compares \({C}_{k}\) with \({C}_{i}\) to determine which data block \({m}_{k}^{*}\) is the plaintext of \({C}_{i}\). However, as shown in Sect. 5.2.5, if the attacker wants to encrypt the data block \({m}_{k}^{*}\), he/she must obtain the MLE key \({K}_{i}\), where \({K}_{i}=H({\sigma }_{i})\). Hence, the attacker must obtain the secret \({\sigma }_{i}\) stored in the key servers. Since the key servers are independent and the attacker cannot collude with the key servers, the attacker cannot obtain the server-side secret \({\sigma }_{i}\). So, the attacker cannot obtain the MLE key \({K}_{i}\) and calculate the ciphertexts of the candidate data blocks. Therefore, the attacker cannot launch brute-force attacks.

Next, we analyze that our scheme can remove the single-point-of-failure problem. In order to obtain \({\sigma }_{i}\), the attacker attacks one key server, and then finds the secret share from the information stored in the compromised key server. As shown in Sect. 3.2.3, the secret \({\sigma }_{i}\) is stored in \(n\) key servers in the form of \((t,n)\)-threshold blind tag. The server-side secret \({\sigma }_{i}\) can be recovered only after at least \(t\) secret shares are obtained. If the number of compromised key servers does not reach \(t\), the attacker cannot obtain enough information from these compromised key servers to recover \({\sigma }_{i}\). In addition, our scheme divides the whole time into segments of fixed duration, called epochs. At the end of each epoch, we will select a batch of new key servers to store the server-side secret \({\sigma }_{i}\), where the server-side secret \({\sigma }_{i}\) has not changed but the secret shares stored in new key servers have changed. In other words, at the end of each epoch, all secret shares stored by the compromise key servers acquired by the attacker will expire, which further enhances the security of the server-side secret \({\sigma }_{i}\). Therefore, our scheme can remove single-point-of-failure problem.

6.2 Data confidentiality

In this section, we show that no adversary, including the Cloud Server Provider (CSP), unauthorized researchers and external adversary, can get the plaintext of the data stored in the cloud server.

Theorem 1

For the CSP, the proposed scheme satisfies semantic security as long as the Discrete Logarithm (DL) assumption holds.

Proof

We define a polynomial-time adversary \(\mathcal{A}\) to simulate the corrupted semi-trusted CSP.

Assume the adversary \(\mathcal{A}\) can obtain the plaintext from the ciphertext \(C\) with advantage \(\epsilon \left(\kappa \right) (\kappa \) is a secure parameter\()\). We construct an algorithm\(\mathcal{B}\), which can break the DL assumption with the same advantage as the adversary\(\mathcal{A}\). The follows are the details of the security game:

Init: Let \((G,{G}_{T},p,P,e)\) be a tuple published by the system, where \(p\) is the prime order of \(G\) and \({G}_{T}\), and \(P\) is the generator of \(G\). \(\mathcal{B}\) randomly chooses a server-side secret \(s\in {\mathbb{Z}}_{p}^{*}\) and computes the public value \(Q=s\cdot P\in G\). \(\mathcal{B}\) generates a subgroup \({G}_{s}=\{{s}_{1},{s}_{2},\dots \}\) in \(G\), where all elements satisfy the equation \(Q={s}_{i}\cdot P (i=\mathrm{1,2},\dots )\) and \(s\notin {G}_{s}\). Given the parameters \(y\), where \(y=s\) or \(y\in {G}_{s}\), \(\mathcal{B}\) gives the parameters \((G,{G}_{T},p,P,e,Q,y)\) to \(\mathcal{A}\).

Challenge \(\mathcal{A}\) constructs two data blocks \({m}_{0}, {m}_{1}\in G\), and then sends them to \(\mathcal{B}\). \(\mathcal{B}\) selects a uniformly random bit \(b\in \{\mathrm{0,1}\}\), and then computes the encryption key \(K=H(s\cdot H({m}_{b}))\). Finally, \(\mathcal{B}\) outputs the ciphertext \({C}_{b}=E(K,{m}_{b})\).

Guess \(\mathcal{A}\) generates a prediction \({b}^{\prime}\) of \(b\). If \({b}^{\prime}=b\), \(\mathcal{B}\) returns \(1\) to denote that \(y=s\); otherwise, \(\mathcal{B}\) returns \(0\) to denote that \(y\) is a random element from the subgroup \({G}_{p}\).

Based on the assumption, \(\mathcal{A}\) can obtain server-side secret \(s\) from the public value \(Q=s\cdot P\) with advantage \(\epsilon \left(\kappa \right)\) when \(y=s\), and then \(\mathcal{A}\) can obtain the encryption key \(K=H(s\cdot H({m}_{b}))\). Finally, \(\mathcal{A}\) can obtain \({m}_{b}\) with the encryption key \(K\) by computing \({m}_{b}=D(K,{C}_{b})\). Hence, we know \(\mathrm{Pr}\left[\mathcal{A}\left({\mathrm{b}}^{\mathrm{^{\prime}}}=\mathrm{b}\right)\right]=\frac{1}{2}+\epsilon (\kappa )\). Since \(\mathcal{B}\) returns \(1\) only when the prediction \({b}^{\prime}\) of \(\mathcal{A}\) is equal to \(b\), we know \(\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,Q,y\right)=1:y=s\right]=\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}+\epsilon (\kappa )\).

When \(y\) is a random element from \({G}_{s}\), \(y\) is evenly distributed in \({G}_{s}\) and is unaffected by \(b\). Therefore, we have \(\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}\), which indicates that \(\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,Q,y\right)=0:y\in {G}_{s}\right]=\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}\). So we can obtain that \(DL-Ad{v}_{\mathcal{B}}=|\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,Q,y\right)=1:y=s\right]-\mathrm{Pr}[\mathcal{B}\left(G,{G}_{T},P,Q,y\right)=0:y\in {G}_{s}]|=\left|\frac{1}{2}-\left(\frac{1}{2}+\epsilon \left(\kappa \right)\right)\right|=\epsilon (\kappa )\), which implies the advantage that polynomial-time algorithm \(\mathcal{B}\) determines whether \(y=s\) is \(\epsilon (\kappa )\). Based on the DL assumption, we can find that the advantage \(\epsilon (\kappa )\) is negligible.

In our scheme, the hospital \(HP\) authorizes the researchers through the master key \(msk\). Hence, the only way for unauthorized researchers to access electronic medical records is to extract \(msk\) from \(HPR=msk\cdot x\cdot y\cdot P\). Next, we prove that our scheme is semantically secure for unauthorized researchers.

Theorem 2

For unauthorized researchers, the proposed scheme satisfies semantic security as long as the Computational Diffie-Hellman (CDH) assumption holds.

Proof

We define a polynomial-time adversary \(\mathcal{A}\) to simulate the corrupted unauthorized researchers. Assume the adversary \(\mathcal{A}\) can obtain the master key \(msk\) from the ciphertext \(HPR\) with advantage \(\epsilon \left(\kappa \right)\). We construct an algorithm \(\mathcal{B}\), which can break the CDH assumption with the same advantage as the adversary \(\mathcal{A}\). The follows are the details of the security game:

Init Generate the parameters \(\left(P,x\cdot P,y\cdot P,W\right)\), where \(P\) is the generator of \(G\), unknown \(x,y\in {\mathbb{Z}}_{p}^{*}\), \(x\cdot P, y\cdot P\in {G}_{T} and W\in {G}_{T}\). \(\mathcal{B}\) gives the public parameters \((G,{G}_{T},p,P,e,x\cdot P,y\cdot P)\) to \(\mathcal{A}\).

Challenge \(\mathcal{A}\) constructs two data blocks \({m}_{0},{m}_{1}\in G\), and then sends them to \(\mathcal{B}\). \(\mathcal{B}\) selects a uniformly random bit \(b\in \{\mathrm{0,1}\}\), and returns the ciphertext \(HPR={m}_{b}\cdot W\).

Guess: \(\mathcal{A}\) generates a prediction \({b}^{\prime}\) of \(b\). If \({b}^{\prime}=b\), \(\mathcal{B}\) returns \(1\) to denote that \(W=x\cdot y\cdot P\); otherwise, \(\mathcal{B}\) returens \(0\) to denote that \(W\) is a random element from \({G}_{T}\).

Based on the assumption, \(\mathcal{A}\) can obtain \(W\) from the tuple \((P,x\cdot P,y\cdot P)\) with advantage \(\epsilon (\kappa )\) if \(W=x\cdot y\cdot P\). Then \(\mathcal{A}\) can obtain \({m}_{b}\) with \(W\) by computing \({m}_{b}=HPR\cdot {W}^{-1}\). Hence, we have \(\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}+\epsilon (\kappa )\). Since \(\mathcal{B}\) returns \(1\) only when the prediction \({b}^{\prime}\) of \(\mathcal{A}\) is equal to \(b\), we know \(\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,x\cdot P,y\cdot P,W\right)=1\right]=\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}+\epsilon (\kappa )\).

When \(W\) is a random element from \({G}_{T}\), \({m}_{b}\cdot W\) is evenly distributed in \({G}_{T}\) and is unaffected by \(b\) from \(\mathcal{A}\)’s view. Therefore, we have \(\mathrm{Pr}\left[\mathcal{A}\left({b}^{\prime}=b\right)\right]=\frac{1}{2}\), which indicates that \(\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,x\cdot P,y\cdot P,W\right)=0\right]=\frac{1}{2}\). So we can obtain that \(CDH-Ad{v}_{\mathcal{B}}=\left|\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,x\cdot P,y\cdot P,W\right)=1\right]-\mathrm{Pr}\left[\mathcal{B}\left(G,{G}_{T},P,x\cdot P,y\cdot P,W\right)=0\right]\right|=\left|\frac{1}{2}+\epsilon \left(\kappa \right)-\frac{1}{2}\right|=\epsilon \left(\kappa \right)\), which implies that the advantage of polynomial-time algorithm \(\mathcal{B}\) to determine whether \(W=x\cdot y\cdot P\) is \(\epsilon (\kappa )\). Based on the CDH assumption described in Sect. 3.4, we can find that the advantage \(\epsilon (\kappa )\) is negligible.

Theorem 3

For external adversaries, the proposed scheme satisfies semantic security as long as the DL and CDH assumptions hold.

Proof

As we all know, external adversaries have less relevant information than internal adversaries. Hence, there is no doubt that external adversaries are unable to get the plaintext of outsourced data.

6.3 Data integrity

We will prove that our scheme can ensure data integrity, i.e., our scheme can ensure the integrity of both the uploaded data and the downloaded data. Specifically, assume that the hospital \(HP\) uploads the tag \({\tau }_{i}\) of data block \({m}_{i}^{*}\), and the cloud server does not find duplicate and requires \(HP\) to upload data, while \(HP\) intentionally or unintentionally uploads the ciphertext \({C}_{k}\) of other data block \({m}_{k}^{*}\). If the cloud server does not check data integrity, it will store the error data \({C}_{k}\). When \(HP\) uploads the tag \({\tau }_{i}\) of data block \({m}_{i}^{*}\) again, the cloud server will find duplicate and will not require \(HP\) to upload data. So \({C}_{k}\) stored in the cloud server can never be corrected. When the researcher \(R\) would like to download the data \({C}_{i}\), he/she actually downloads the error data \({C}_{k}\). As a result, the researcher cannot decrypt \({m}_{i}^{*}\) from \({C}_{k}\). Similarly, assume that \(HP\) uploads the data block \({C}_{i}\) correctly. When the researcher \(R\) wants to download the ciphertext \({C}_{i}\), the cloud server returns the ciphertext \({C}_{k}\) of other data blocks \({m}_{k}^{*}\) to \(R\). The researcher cannot decrypt \({m}_{i}^{*}\) from \({C}_{k}\). Fortunately, our scheme can detect incomplete data. As shown in Sect. 3.2.5, before uploading the data block \({m}_{i}^{*}\), \(HP\) uploads its tag \({\tau }_{i}\) to the cloud server, where \({\tau }_{i}=H({C}_{i})\). When the cloud server receives the data block \({C}_{k}\), the cloud server checks whether \({\tau }_{i}=H({C}_{k})\) holds. If \({\tau }_{i}\ne H({C}_{k})\), then \({C}_{k}\ne {C}_{i}\). It indicates that the data is incomplete. Similarly, in the process of data sharing, \(HP\) sends the tag \({\tau }_{i}\) to the researcher \(R\) who checks whether \({\tau }_{i}=H({C}_{k})\) holds after downloading \({C}_{k}\) from the cloud server. If \({\tau }_{i}\ne H({C}_{k})\), the data is incomplete.

7 Performance evaluation

To demonstrate the efficiency of our scheme, we conduct experiments using a real-world dataset [39]. This dataset contains about 300 electronic medical records which are divided into 6300 data blocks. Note that the electronic medical records of the dataset exclude all patient’s sensitive information. We pad the patient’s sensitive information to these records to form intact electronic medical records. The total sensitive information accounts for about 38% of whole electronic medical records. We do experiments in java running on a desktop computer with Windows OS system, a 2.10 GHz Inter Core i5 CPU and 8 GB memory. The security level is set to 64 bits. In order to intuitively show the efficiency of our scheme, we compare our scheme with Zhang et al.’s scheme [26] and Bellare et al.’s scheme [21] to evaluate the performance of these schemes in terms of efficiency of deduplication, storage costs, computational costs and computation delay. It should be noted that we ignore related evaluation about the phase of data download and data sharing since deduplication only happens in the phase of data upload.

7.1 Deduplication efficiency

Firstly, we compare our scheme with DupLESS scheme [21] and HealthDep scheme [26] with regard to the deduplication efficiency. As shown in Fig. 3, the deduplication efficiency of our scheme is the best in these three schemes, followed by DupLESS scheme, and the worst by HealthDep scheme. In the HealthDep scheme, because of the low duplicate ratio of patient’s sensitive information in electronic medical record, the cloud server only performs deduplication on others except patient’s sensitive information. The duplicate data in the sensitive information is not deleted, so the deduplication efficiency of HealthDep is the lowest. The DupLESS scheme does not consider protecting patients’ sensitive information. So it views sensitive information and other information as equal, and performs deduplication directly on whole electronic medical record. The duplicate data in the sensitive information is deleted, thus the deduplication efficiency of DupLESS scheme is slightly higher than that of HealthDep scheme. In our scheme, before uploading electronic medical records, the hospital firstly replaces patients’ sensitive information with wildcards. After the sensitive information is converted to wildcards with the same length, the cloud server performs deduplication. Therefore, the duplicate ratio of sensitive information remarkably increases so that the deduplication efficiency also greatly increases.

Fig. 3
figure 3

efficiency of deduplication

7.2 Storage efficiency

Figure 4 shows the comparison results of storage efficiency. The more data blocks the cloud server stores, the lower the storage efficiency is. It can be seen clearly from Fig. 4 that the storage efficiency of cloud server is influenced by deduplication efficiency and the number of files. As shown in Fig. 4, the storage efficiency of our scheme is the best in these three schemes. Moreover, as the number of files increases, the storage efficiency gap among these three schemes becomes larger and larger. In these three schemes, the deduplication efficiency of our scheme is the highest, so the storage efficiency of our scheme is the highest. HealthDep scheme and our scheme all introduce \(n\) key servers to share a constant server-side secret like the schemes [12]. This secret sharing scheme can improve the security of the server-side secret, but it also needs additional storage space to store \(n\) secret shares. Fortunately, this additional storage space is insignificant. On the one hand, the secret share is derived from the server-side secret \(s\). Thus, the size of the secret share is much smaller than that of the data. On the other hand, \(n\) secret shares are stored in the key servers instead of the hospital and the cloud server. In other words, the secret sharing scheme does not increase the storage overhead of the hospital and the cloud server.

Fig. 4
figure 4

Comparison of storage costs during the data upload phase

7.3 Computational costs

We provide a comparison of computational costs in Fig. 5. We randomly select 10, 150 and 300 files from the dataset to do experiments. When the total number of files is relatively small, the computational time of these three schemes is shown in Fig. 5a. When the number of files is 0, the computational time of DupLESS scheme is the shortest, followed by HealthDep scheme and our scheme. The main reason is that HealthDep scheme and our scheme use multiple key servers to protect the server-side secret. In the process of system setup, \(n\) key servers interact each other to generate secret shares which consumes some time. In contrast, DupLESS scheme employs single key server which saves time for generating secret shares. As shown in Fig. 5b, when the total number of files increases, the computational costs of DupLESS scheme is the highest, followed by HealthDep scheme and our scheme. The main factor is that HealthDep scheme and our scheme adopt the way of multiple key server interaction to ensure the security of the server-side secret. But DupLESS scheme uses only single key server. To ensure the security of the server-side secret, DupLESS scheme adopts RSA-OPRF protocol which contains some modular exponentiations of cycle multiplicative group. Compared with Hash algorithm adopted by HealthDep scheme and our scheme, RSA-OPRF protocol consumes more time. Therefore, the computational costs of DupLESS scheme are the highest. In Fig. 5b, we can see that the computational time of HealthDep scheme is similar to that of our scheme. In Fig. 5a and c, the computational time of our scheme is slightly longer than that of HealthDep scheme. The main reason is that our scheme takes extra time to blind the sensitive information compared with HealthDep scheme. Although the computational cost of HealthDep scheme is slightly better than that of our scheme, HealthDep scheme needs to store more data blocks. In contrast, our scheme needs reasonable computational cost while ensuring storage efficiency.

Fig. 5
figure 5

Comparison of computational costs during the data upload phase

7.4 Computation delay

In the Fig. 5a and c, we observe that the computational time of our scheme is slightly higher than that of HealthDep scheme. In order to explore the influencing factors, we compare the time for users to generate MLE keys in these three schemes in Fig. 6. Note that we define the time for users to generate MLE keys as the computation delay. As shown in the Fig. 6, the computation delay of DupLESS scheme is the highest, followed by our scheme, and the lowest by HealthDep scheme. Just as we analyzed in Sect. 7.3, DupLESS scheme spends a lot of time in key generation, so the computation delay of DupLESS scheme is much higher than that of HealthDep scheme and our scheme. In addition, the main reason why the computation delay of HealthDep scheme is lower than that of our scheme is that the sensitive information is handled in different ways. In the HealthDep scheme, the hospital randomly selects a key which is used to encrypt the sensitive information. In our scheme, the hospital replaces the sensitive information with wildcards firstly, then interacts with the key servers to produce the corresponding MLE keys. Besides the time of blinding sensitive information, our scheme also needs to spend time interacting with the key servers. Although HealthDep scheme can reduce computation delay, it also reduces the deduplication efficiency and storage efficiency of cloud server. In contrast, our scheme achieves low computation delay while ensuring high deduplication efficiency.

Fig. 6
figure 6

Comparison of computation delay during the data upload phase

In conclusion, compared with schemes [21, 26], our scheme has shown good efficiency in terms of data deduplication, storage costs, computational costs and computation delay.

8 Conclusion

In this paper, we propose a secure data sharing scheme with data deduplication and sensitive information hiding. In our scheme, multiple key servers are adopted to resist brute-force attacks and single-point-of-failure attack. We replace sensitive information of electronic medical record with wildcards to ensure the privacy of the sensitive information which also improves deduplication efficiency. We analyze the characteristics of electronic medical records of the same disease and divide data blocks into different categories based on their duplicate ratios, which helps researchers choose the data according to duplicate ratios. Performance evaluation shows that our scheme is indeed efficient according to data deduplication, storage costs, computational costs and computation delay.