Keywords

1 Introduction

With the rapid improvement of cloud computing, the cloud offers flexible and affordable software, platforms, infrastructure, and storage available to organizations across all industries. Faced with limited budgets and increasing growth demands, cloud computing presents an opportunity for organizations to reduce costs, increase flexibility, and improve IT capability [14]. Despite the rapid adoption of cloud computing, security and privacy remain key issues for the security community [20, 25]. Although cloud service providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) continue to expand security services to protect their evolving cloud platforms, security and privacy are ever-lasting considerations while migrating traditional IT to cloud [10].

As we all know, cyberspace is not peaceful, and both external advanced persistent threats (APTs) and insider attacks still occur from time to time. Since cloud environments often contain a variety of tenants and their vast amounts of valuable data, cloud platforms are also targeted by cyber threat actors [32,33,34]. For external APTs, attackers are always looking for new attack surfaces in the cloud to bypass existing security controls [41]. Insider attacks are often acted by disgruntled insider employees, who have limited authorized access and tend to exfiltrate sensitive data or escalate privilege intentionally. This is also an ethical issue due to the human factor instead of a technical issue.

The implementation of cloud migration by enterprises means losing physical control of systems and data, thus it requires an assessment method to evaluate the protection level of cloud environments, including cloud data. Although many standards, frameworks, and best practices have been proposed by the security community and industry, there is rarely a comprehensive evaluation model that can quantitatively analyze the score of the protection level of cloud data that fully considers security, privacy, and ethical issues. In summary, this paper makes the following contributions:

  • We are the first to make a comprehensive review of every aspect of cloud data protection based on our full knowledge of security, privacy, and ethics issues, which consists of technological mechanisms, operational policies, and legal & regulatory compliance.

  • We present a novel algorithm to compute the score of protection level based on our insight about important factors that affect cloud data protection, that is, intra-phase, inter-phase, lifecycle operations, and compliance.

  • We propose an empirical evaluation model to assess the overall protection level of cloud data based on the score of each factor that affects the protection level.

2 Overview of Cloud Data Protection

This section introduces the methodology related to cloud data protection. We leverage the Data States Model and Cloud Data Lifecycle Model to summarize major security and privacy controls from a top-level perspective. More considerations on fine-grained controls including technique measures, operational policies, and legal & regulatory compliance will be discussed in the rest of the paper.

2.1 Data State Model

Data, as a type of critical asset, exists in one of three states both on-premise and in the cloud, including while it is at rest, in transit, and in use [26]. Regardless of the state of the data, IT systems should implement appropriate controls to protect the data and mitigate security and privacy risks [43]. Figure 1 shows the Data State Model, including three states of data and transformation between them. Data in use can be converted to both in-transit state and at rest, however, data at rest cannot be changed to in-transit state directly and vice versa. It is worth noting that this characteristic of conversion between data states depends on the classical Von Neumann architecture which is still the major one all over the world, other computing architectures e.g. quantum computing are out of our scope.

Fig. 1.
figure 1

The data states model in IT systems.

Data in Use, refers to any data in the main memory or other caches while an application is using it. Due to the multitasking and concurrent features of modern information systems, it is important to ensure authorized access to data in use. Operating System (OS) built-in process isolation and application-level sandbox are primary controls for data in memory and cache. However, emerging attack vectors often try to bypass existing security mechanisms by vulnerabilities exploitation or advanced impersonation techniques. To this end, pieces of research attempt to leverage homomorphic encryption [1]. This limits the risk of data leakage because memory doesn’t hold unencrypted data.

Data at Rest, aka data on storage, is any data stored on media, such as hard drives, external USB drives, network attached storage (NAS), and storage area network (SAN). The major risks it faces include data exfiltration, integrity breaches, unavailability (e.g. Denial of Service i.e. DoS). Strong symmetric encryption is the key control of data at rest for security and privacy concerns. In Addition, as a compensating control, data redundancy can improve the high availability (HA) of data. Furthermore, strict authentication and authorization controls [30] can also help prevent unauthorized access.

Data in Transit, also called data in motion, refers to any data transmitted over a network. The exchange of data between information infrastructures almost entirely depends on the transmission network in cyberspace. In particular, unlike the traditional on-premises model using the internal local networks, if enterprises migrate their IT systems to the cloud, all data access will be transferred over the Internet. Therefore, data in transit is more likely to be the target of cyber attacks than the other two data states. Leveraging a combination of symmetric and asymmetric encryption can protect data in transit generally.

2.2 Cloud Data Lifecycle

Data in cloud is constantly being created, stored, used, and transmitted, and once the data is no longer valuable, it needs to be destroyed. Unlike other valuable physical assets, the value of data is time-sensitive and specific, thus data protection is sophisticated. Cloud Data Lifecycle Model provides a generic approach to identifying the broad categories of risks facing the data and associated security or privacy controls, therefore this allows us to consider threats, vulnerabilities, and risks of cloud data at a higher level of abstraction in case of getting bogged down in the concrete details of a specific organization.

Table 1 illustrates each phase of this model and corresponding representative controls. Noting that the cloud data lifecycle is not always iterative, on the contrary, it is not constantly linear, sometimes even exists in multiple phases simultaneously. As an example, data being shared may be used and stored at the same time if co-workers collaborate in a Software as a Service (SaaS) app. Furthermore, data in a phase can also exist in multiple states. Regardless, data should be protected at every stage with security and privacy controls commensurate with its value [18].

Table 1. Cloud data lifecycle and representative controls
  • Create. Data can be created by a user at on-premises or legacy workstations and then transferred to the cloud, or created directly in the cloud. Data classification [13] is the most important security control at this stage. In addition, if the data is created remotely, transportation security mechanisms such as IPSec/TLS VPN [22] are necessary solutions.

  • Store. Typically, this phase is synchronized with the creation phase, and encryption is introduced to mitigate threats exposed in the data center of the cloud environment. Meanwhile, for any cryptosystem, key management is also a critical security control.

  • Use. Unlike data access on-premises, accessing cloud data remotely requires more additional security and privacy controls, including: (1) implementation of virtualization protection controls [44] to ensure that there is no unauthorized access between different guests hosted on the same server; (2) leveraging Digital Rights Management (DRM) aka Information Rights Management (IRM) solutions [40] to implement fine-grained dynamic access control; (3) conducting continuous monitoring via Data Loss Prevention (DLP) solutions [21] which is also used for data classification in create phase; and (4) to enable network security transmission mechanisms, such as aforementioned IPSec/TLS VPN.

  • Share. This phase is relatively self-explanatory, that is, data is granted by its owner to other users or entities that require access. If the data has been highly classified, more accurate and fine-grained access control rules can be provided for data sharing. Conversely, additional controls are required to conduct data protection, many of controls implemented in prior phases will be effective here, such as DRM/IRM solutions, IPSec/TLS VPN, and so forth. Furthermore, due to distributed cloud data centers, data can be located in data centers in different jurisdictions. Thus several restrictions may exist in accordance with regulatory mandates based on the location of the data center.

  • Archive. As an integral part of the cloud data life cycle, archiving data is used for: (1) Business Continuity/Disaster Recovery (BC/DR) [4, 29]; (2) Data retention and audit [12]; (3) eDiscovery [28]; and other (3) compliance requirements. Similar to the storage phase, the primary security control in this phase is encryption. In addition, due to long timeframes for storage in archives, there may be additional concerns related to availability.

  • Destroy. Data that is no longer useful and is no longer subject to retention requirements should be securely destroyed. There are many options for data destruction of legacy IT environments or on-premises, e.g. deletion, overwriting, degaussing, etc. However, in the cloud environment, due to data dispersion techniques, there is only one choice to destroy data, which is crypto shredding aka cryptographic erasure [38]. This mechanism refers to encrypting data leveraging one strong encryption engine, then using the key generated by that process, encrypting the data on another different encryption engine, and destroying the key thereafter.

2.3 Shared Responsibility Model

Cloud computing is a business-driven computing model rather than technology-driven, thus the interests of cloud service providers (CSPs) and cloud service customers (CSCs) are not always aligned. CSCs want maximum computing capabilities at the lowest cost. On the other hand, CSPs want to provide as few services as possible while maximizing profits. In this paper, we don’t review the cloud computing reference model here, which is clearly defined in the ISO/IEC 17789 [24]. Fortunately, despite the adversarial relationship existing between the two sides, the interests of security and privacy on both sides converge. One example is that a data breach of a CSC caused by vulnerabilities of the infrastructure in a CSP will bring both parties to suffer brand and reputation damage, lower profits, and even face ongoing lawsuits.

The Cloud Shared Responsibility Model [16] clarifies the clear responsibilities of both the CSPs and CSCs for defense-in-depth of cloud architecture. Table 2 shows details of this model, where the rows and columns represent the layers and cloud service models of the cloud architecture, respectively. In Table 2, cells marking C, S, and P indicate the responsibilities of CSC, Both, and CSP, respectively. It is worth mentioning that although we only list the most common three cloud service models here, namely Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS), which are also defined in ISO/IEC 17789 [24], other service models have similar shared responsibility model. We are particularly concerned that regardless of the service model, the responsibility for data access security attributes to the CSC. This means that the ultimate responsibility for any data breach should be borne by the CSC. Of course, the CSC also has the right to seek compensation from the CSP. Due to the elementary principle of “layered defenses” in information security, both CSCs and CSPs need to implement security and privacy controls at different layers to protect data. In a nutshell, this paper doesn’t intend to clearly distinguish which party is responsible for the implemented security controls, which does not influence our evaluation of security, privacy, and ethics of controls.

Table 2. The shared responsibility model

3 Techniques, Operations, and Compliance

The practice of industry in the past decade shows that a large body of previous excellent approaches, mechanisms, and tools in traditional IT has been introduced for better building the foundation of cloud computing. Therefore, the cloud and traditional IT also share most of the security controls to secure systems and data. These controls usually include three aspects, namely techniques, operational activities, and compliance. We will discuss security and privacy controls for cloud data protection, along with the possible risks and how to mitigate them. Furthermore, ethical considerations are also discussed, which can be also considered as a risk in essence, and refer to acting honorably, honestly, justly, and legally, due diligence, and due care.

The rest of this section will discuss cloud-specific key security and privacy controls involving three aspects i.e. technical mechanisms, operational policies, and legal compliance.

3.1 Technological Mechanisms

Unlike on-premises, the cloud environment can rarely protect data by implementing strong access controls at clear boundaries, thus encryption is the primary option for protecting data. It is known that cloud computing is usually multi-tenant, and even with the deployment model of private cloud, there are also conflicts of interest among different departments within the same organization. Therefore, data obfuscation mechanism is an important security and privacy control. In addition, virtualization techniques and corresponding security controls, as elementary cloud infrastructure, face several cloud-specific risks.

1) Encryption and Key Management

It should come as no surprise that cloud computing has a deep dependency on encryption, and no matter what state the data is in, without encryption technology, it is impossible to use cloud computing technology in any secure way. Due to the criticality of encryption organizations should concentrate their efforts on correctly implementing and deploying cryptographic systems, while key management is the area of greatest concern. If an organization uses multiple CSPs or intends to hold physical control over cryptographic keys, one solution is to escrow keys within the organization, but this requires additional infrastructure and personnel. Another way is to escrow keys to a third party, such as the prevailing Cloud Access Security Broker (CASB) [3], which is a service that provides key management and unified cloud data access control.

Despite all the efforts to encrypt data in the cloud, there are still risks that make us have to strike a balance. First, encryption can be done at different layers and granularities, such as volume-level, object-level, file-level, application-level, and so forth [42]. For the performance reason, it is difficult to implement strong encryption at all layers. As an example, despite implementing volume-level encryption, which is used to be connected to a virtual machine (VM) instance, it is still vulnerable if an attacker gains access to the VM instance. Second, IT administrators or security staff may be necessary to access other personnel’s cryptographic keys for key recovery or other reasons. If a disgruntled employee gets the key, it will increase the risk of unauthorized access. This is also an ethical issue. Third, despite not a good practice, CSCs, for technical or budgetary reasons, also escrow keys to the same CSP that also stores the organization’s data. This risk of dependency is also termed Lock-In, which will be discussed in Sect. 3.2. Finally, due to legal and regulatory requirements for specific encryption algorithms or methods, there is a security gap between different jurisdictions where data has transborder exchange. This issue will be detailed in Sect. 3.3.

Fig. 2.
figure 2

Overview of tokenization mechanism.

2) Data Obfuscation and De-identification

Concerning security and privacy, practical cloud data protection is necessary to obscure sensitive data or instead use a representation of that data. Masking is an elementary data hiding technique (e.g., showing only the last four digits of a credit card number), and similar techniques include randomization which replaces part of the data with random characters, and shuffling that represents the data with different records within the same dataset. Tokenization is another privacy protection technique which is illustrated in Fig. 2, a nonsensitive tag called a token is created as a substitute to be used in place of sensitive data. The implementation of tokenization typically consists of two databases, one storing actual and real sensitive data, and the other storing tokens corresponding to each data entry. A user who needs to access data first obtains nonsensitive tokens, and then a strong access control mechanism such as Identity and Access Management (IAM) [15] decides whether this user can access the corresponding sensitive data entries. Anonymization is the primary technique for de-identifying when the data contains Personally Identifiable Information (PII). This process includes removing direct identifiers e.g. names, bank accounts, and indirect identifiers which are often statistical or demographic information but can be combined to infer PII e.g. personal age and shopping history [6, 8, 31, 37].

There are three major risks during the implementation of the aforementioned security and privacy controls. First, the above techniques can perform well for structured data but may present problems on unstructured data that could be located in any media. Although the existing available solution is DLP and continuous monitoring, this is still not enough to address the challenge of sensitive data mining. Second, the tokenization technique depends on the access control mechanism, thus we have to face all the risks, and the human factor is always the most significant risk among those. This is also an ethical dilemma. Last, although most privacy regulations require data anonymization or de-identification for any PII use outside of live production environments, how to identify indirect identifiers is also a hard nut to crack due to the lack of effective rules to estimate whether the information is an indirect identifier that seems humble but can be combined with other information to infer PII.

3) Virtualization

Proverbially, virtualization technology is the cornerstone of cloud computing, which helps the cloud to implement critically acclaimed on-demand services and resource pooling. In a sense, virtualization is also a security control that achieves access control through the isolation of diverse layers. Despite all the convenience, risks need to be considered while protecting cloud data in practice using virtualization. First, since the hypervisor that manages VM instances is the critical component of the virtualization solution, it tends to be attacked. Compromising a VM instance only results in the data breach within the VM guest, thus threat actors may instead attempt to compromise the hypervisor. Because the hypervisor acts as the interface and controller between the virtualized instances and the host resources, exploiting the hypervisor can affect the security of all VM guests [7]. Another risk is guest escape. Weakly designed or configured VM instances or hypervisors may allow users to break restrictions and leave their own VM instances to gain unauthorized access. There are two ways of guest escape, one is lateral movement, that is, unauthorized access from one VM guest to another one, and the other is vertical movement, that is from one VM guest a user obtains the host machine permissions. As a matter of fact, the second way is more harmful to cloud data protection. Finally, since the cloud environment is multi-tenant, we have to deal with data seizure issues. Legal activity may result in the seizure or inspection of the host machine which has hundreds of VM instances belonging to different CSCs by law enforcement agencies or plaintiff attorneys, even if the organization is not the target. Great efforts still need to be made to cope with this problem by both the security community and the judicial community [39].

3.2 Operational Policies

While technical controls have laid the foundation for mitigating cloud risks, security operations in the cloud provide ongoing security and privacy assurance. This section will discuss several key controls in security operations. Due to space reasons, we will not go into all the details here, but more policies of security operations analysis.

1) Data Classification

Data identification and classification are the foundation of cloud data protection. All implemented technical and administrative controls determine the level of protection based on the classification of data. Since the organization is constantly creating data in its operations, this is an operational process, aka “Data Discovery” [17]. Typical approaches to data discovery include label-based, metadata-based, and content-based ones. Whether the data is created on-premises or in the cloud, an assistant tool to data classification is DLP, a technology system designed to identify, inventory, and control the use of data that an organization deems sensitive, regardless of whether it is employees’ personal data, such as web browsing history, pending resignation letters, and so forth. In a nutshell, data discovery can sometimes be a “double-edged sword”, raising privacy and ethical concerns.

2) DRM/IRM

Data is out of the physical hold of the organization in the cloud, thus compensating controls are needed to protect the data during its lifecycle, especially during the use and share phases. DRM aka IRM which is mentioned in Sect. 2.2 is an ideal mechanism [23, 27]. DRM/IRM usually has the following advantages: (1) persistent protection, which follows the information it protects, regardless of where it is located; (2) dynamic policy control, which allows data owners to modify access control lists (ACLs) and permissions for the protected data under their control; (3) remote rights revocation, which the data owner can revoke permissions at any time; (4) continuous auditing, that allow for comprehensive monitoring of the access history.

Despite the many advantages of DRM/IRM, leveraging DRM/IRM in the cloud still faces some challenges. One is replication restrictions. DRM/IRM involves permissions for replication and sharing, but the administrative process in the cloud environment often requires creating, shutting down, moving, and backing up VM instances, which is undoubtedly in conflict with the policies of DRM/IRM. The other is jurisdictional conflicts. The blurred physical interface brought about by cloud computing will bring about the transborder flow or even out control of a large amount of data, which will lead to regulatory restrictions in different jurisdictions.

3) Continuous Monitoring

Automated or continuous monitoring and reporting is an important mechanism for cloud computing to achieve its capability of self-service. The monitoring objects mainly include: (1) physical environment, involving the temperature, moderation, and so forth of the data center; (2) host-level, including the performance and event tracing of the operating system, middleware, and applications; (3) network-level, refers to monitoring various network components, not only hardware and software but also cabling, Software Defined Network (SDN), and control plane.

Continuous monitoring can improve performance and enhance security, however, it can also raise privacy and ethical issues. As an example, the CSP collects the event tracking log of the operating system through the agent installed in the VM instance, so as to obtain the VM guest status and implement anomaly detection of the system. Although the system event log does not contain direct identifiers i.e. PII, the user’s behavioral characteristics can still be analyzed by reasoning about system events. Thus those auditing data can be used for precision marketing, and in the worst-case obtained by cyber threat actors to understand user behavior so that they can prepare proper attack vectors.

3.3 Legal and Regulatory Compliance - LRC

Since the essence of cloud computing is to drive business improvement, many of its features such as decentralization and multi-tenancy make it difficult to comply with existing data privacy protection and other laws and regulations.

1) eDiscovery

eDiscovery refers to the process of identifying and obtaining electronic evidence for either prosecutorial or litigation purposes. Since cloud computing is often multi-tenant, it is more difficult to find data owned by one CSC without invading data from other CSCs that may reside on the same storage volume, drive, or physical machine. In addition, from a judicial point of view, all evidence needs to be tracked and monitored from the time it is recognized as evidence and acquired for that purpose, which is also called chain of custody. While the design of cloud computing may dynamically allocate and recycle resources for other tenants in the same storage location, which is in conflict with judicial principles. Thus when creating security and privacy policies for maintaining a chain of custody or conducting activities requiring the preservation and monitoring of evidence, we need to comply with the regulations.

2) Diverse Jurisdictions

A great deal of the difficulties in compliance with the legal and regulation of cloud computing stems from the design of cloud computing. They are often dispersed, often across the county, state, and even international borders. As mentioned earlier, transborder transfer of data is the most difficult reason for cloud to comply with laws and regulations. The governance of compliance must take all of the applied laws and regulations into account to operate reasonably with an understanding of legal risks and liabilities in the cloud.

4 Empirical Evaluation Model

This section will detail our proposed evaluation model for cloud data protection. By analyzing the aforementioned factors that affect cloud data security and privacy, we present a novel algorithm to quantitatively calculate the score of cloud data protection in a specific organization, and show its overall protection level.

4.1 Important Factors

We consider four factors to be important when assessing the protection level of cloud data in an organization.

  • Intra-phase. As mentioned in Sect. 2.2, a variety of security and privacy controls are implemented at each phase of the cloud data lifecycle. In Table 1 we can see, even within one phase, there may be multiple data states, which means that more controls need to be implemented so that data in different states are protected. To this end, the more states of the data within a phase, the more assessment is required. Furthermore, data in transit is generally more vulnerable to compromise than data in storage and at rest. Due to the fact that data in transit on public carriers is beyond the confines of the cloud itself. Based on this intuition, we prefer to assign a higher weight to phases with multiple states or containing the transit state in our evaluation model.

  • Inter-phase. Over time, the more phases cloud data passes through in its lifecycle, the more opportunities it has to be processed, used, and shared. In other words, data in later phases are more critical than in former ones, thus more controls need to be implemented in later phases. As an extreme example, if the data is not properly destroyed during the destruction phase, such as using an insecure method (e.g., simply using the OS delete or rm command), or if the encryption key is leaked despite leveraging the recommended method of crypto shredding, the organization will eventually lose control of the data that they think this should be destroyed but can still be accessed. In short, the later phases of the evaluation deserve more attention.

  • Lifecycle operations. Cloud security operations are the guarantee to manage and mitigate cloud data risks within an acceptable range. Although many aspects of security operations are handled by CSPs, that is to say, it is invisible and imperceptible to the CSCs, we should understand that cloud security operations include many global security controls during the whole cloud data lifecycle, such as BC/DR and continuous monitoring mentioned in Sect. 3.2. Our insight is that the cloud security operations should be considered as security and privacy measures applied to each phase of the cloud data lifecycle, and therefore should be given global priority and proper weight in the evaluation model.

  • Compliance. Compliance requirements are an important aspect of Information Security Management System (ISMS) [2, 5], and there is no doubt that the use of cloud computing presents many challenges in identifying and complying with compliance in specific jurisdictions. In typical cloud scenarios, since resources are allocated dynamically, CSCs do not know the exact physical location of the data. In fact, even CSPs may not know which location the data is in at all times when they manage the VM images and other data, depending on the level of automation and data center design. Similar to security operations, compliance requirements should also be fully evaluated in the assessment model as a global factor.

Hence, to conduct a comprehensive evaluation of cloud data protection, we need to first calculate the scores for intra-phase, inter-phase, operations and compliance, respectively.

4.2 Quantitative Analysis

Table 3. Intra-phase weight rating scale

First, to calculate the score of intra-phase, we define the weights based on different data states within a phase, as shown in Table 3. We prioritize the three data states of in transit, at rest, and in use according to its possibility of risk occurring discussed in Sect. 4.1, and construct a binary truth table. In the last column of Table 3, we can see the weight values generated in different phases due to the existence of different data states, which in parentheses is the binary representation. We further use the weight of intra-phase to compute its protection score, which is defined as:

$$\begin{aligned} IntraS = \frac{1}{6} \times (\sum _{ i =1}^{6} w_{i} \times \frac{1}{max( r_{i} )}) \end{aligned}$$
(1)

where w\(_i\) is the weight of i phase of the cloud data lifecycle defined in Table 3, and r\(_i\) means the severity levels of possible risk in the i phase, which can be found in industry standards or best practices. Since the possible risk within the phase with diverse data states is usually technical issues, we identify the risks and assign their severity levels based on the Common Attack Pattern Enumeration and Classification (CAPEC) list defined by US-CERT and DHS with the collaboration of MITRE [35]. For quantitative calculation, we map the severity level to a numerical value based on the conversion table shown in Table 4 included in the Common Vulnerability Scoring System (CVSS) [19]. Thus, we select the highest value of identified risk and then obtain the intra-phase score.

Table 4. Severity levels of possible risks rating scale

Next, we define the formula for calculating the score of protection level of inter-phase.

$$\begin{aligned} InterS = \frac{1}{6} \times (\sum _{ i =1}^{6} \frac{1}{ (max(r_{i}))^{{w}'_{i}} }) \end{aligned}$$
(2)

where w’\(_i\) = (10+i)/10 for the phase of cloud data lifecycle i, which takes into account slightly higher weights for later phases mentioned in Sect. 4.1. Similar to the formula of intra-phase score, r\(_i\) also means the possible risks during the whole lifecycle of cloud data. We use CAPEC and Cloud Controls Matrix (CCM) published by Cloud Security Alliance (CSA) [9] to identify more possible risks. Noting that our evaluation model has the capability of customization mechanism, whereby risk identification model and severity level can be substituted to expert-specified others. Thus, the risk of the highest severity value is identified to represent the most critical risk in each phase and calculate the sum as the inter-phase score.

Then, to calculate lifecycle operations score, we define the formula as:

$$\begin{aligned} OpS = \frac{1}{ max(r) } \end{aligned}$$
(3)

where r is the possible operations risk, e.g., data breaches. Similarly, the risk of the highest severity value is the representation of the protection level of operations.

Similar to the Ops, the score of compliance is defined as:

$$\begin{aligned} ComS = \frac{1}{ max(r) } \end{aligned}$$
(4)

where r is the possible compliance risk based on the organization’s geographic location, its jurisdiction and industry. As an example, a bank located in EU needs to comply with GDPR [36] and PCI DSS [11].

Last, we give the overall formula to compute the protection level score of cloud data in an organization, which is defined as:

$$\begin{aligned} S = \alpha \times IntraS + \beta \times InterS + \gamma \times OpS + \delta \times ComS \end{aligned}$$
(5)

where the parameters \(\alpha , \beta , \gamma \), and \(\delta \) can be configured based on the expert knowledge and specific application scenarios. Based on our empirical experience, the default values of those parameters are 0.2, 0.2, 0.3, and 0.3 respectively.

4.3 Case Study

Table 5. Case study scores.

We leverage the protection score presented above to assess a financial enterprise that would like to be anonymous, and the result is in accordance with a parallel manual evaluation by experts. Table 5 shows four sub-scores which reflected four evaluation factors mentioned before. For intra-phase score, we obtained the highest severity value in the SHARE stage, this is because the target of evaluation has a weak access control for user PII. While for inter-phase score, we identified the highest severity value when data from the ARCHIVE stage to DESTROY stage due to a lack of approved and unified mechanism for destroying no longer retention data. Then we use the default parameters mentioned in Sect. 4.2, the overall protection score of the target of evaluation’s cloud data is 0.46. We mapped this score value to the magnitude of the numeric interval listed in Table 4, and it shows that the overall protection level is medium. This is consistent with the manual qualitative assessment by another team.

5 Related Work

5.1 Cloud Security Assessment

Traditional information system risk assessment mechanisms are still effective for cloud computing environments. However, as a popular computing architecture, the cloud computing environment has some aspects that are unique to other IT system risk assessments. First, the cloud environment involves more entities, including CSPs, CSCs, cloud users, cloud auditors, cloud carriers, and so forth. These stakeholders bring more challenges to cloud security assessment. Second, the technology stack of cloud computing architecture is more complex, and the evaluation targets include components owned and used by multiple parties such as physical environment, virtualization, and applications. In addition, compliance with the cloud environment is also an important aspect of cloud risk assessment [9].

5.2 Data Security and Privacy

Data security, privacy, and ethics come to be widely considered in the security community and among the legal profession. Whether data incorporates security and privacy controls in its life cycle is a critical observation for data security assessment. Data classification and access control matrix are important ideal data risk assessment tools as well as data security controls. Moreover, continuous monitoring is also used to assess whether the exchange of data violates the organization’s data security policy [12,13,14, 26, 43].

6 Conclusion

In this paper, we make a comprehensive review of each aspect of cloud data protection including security, privacy, and ethical considerations. To evaluate an organization’s cloud data protection level, we propose an empirical model that calculates the protection score based on four important factors we consider. A novel algorithm we present can improve the ability of automated evaluation and the credibility of evaluation results. However, frankly speaking, our evaluation model is still a semi-automatically model that also needs experts to identify risks and conduct other manual activities. This will be our further research goal.