Set-Based Adaptive Distributed Differential Evolution for Anonymity-Driven Database Fragmentation

By breaking sensitive associations between attributes, database fragmentation can protect the privacy of outsourced data storage. Database fragmentation algorithms need prior knowledge of sensitive associations in the tackled database and set it as the optimization objective. Thus, the effectiveness of these algorithms is limited by prior knowledge. Inspired by the anonymity degree measurement in anonymity techniques such as k-anonymity, an anonymity-driven database fragmentation problem is defined in this paper. For this problem, a set-based adaptive distributed differential evolution (S-ADDE) algorithm is proposed. S-ADDE adopts an island model to maintain population diversity. Two set-based operators, i.e., set-based mutation and set-based crossover, are designed in which the continuous domain in the traditional differential evolution is transferred to the discrete domain in the anonymity-driven database fragmentation problem. Moreover, in the set-based mutation operator, each individual’s mutation strategy is adaptively selected according to the performance. The experimental results demonstrate that the proposed S-ADDE is significantly better than the compared approaches. The effectiveness of the proposed operators is verified.


Introduction
Outsourced data storage has shown its advantages in scalability and costs [13,14,33,35]. Data security and privacy are still concerns when the organizations adopt the outsourced data storage [20,26,27,30,31]. In health industry, such concerns are significant [5,11,21,25]. Although database encryption [12,28] can protect the database privacy, encryption and decryption reduce the query efficiency accordingly [17,24,32], which is also crucial in outsourced data storage. By dividing attributes of sensitive associations, separate data models can protect the database privacy while not requiring time-consuming data transformation.
Several algorithms have been proposed for privacy protection in database fragmentation [15,34]. In [2], encrypted database fragmentation was proposed to break the sensitive associations between attributes. Authors in [3] proved that the database fragmentation problem is NP-hard and proposed two heuristic strategies to achieve the optimal number of fragments. Then, a graph search algorithm [29] was proposed based on the confidentiality constraints. These algorithms are designed to address the database fragmentation problem with predefined privacy constraints. However, when it comes to the database fragmentation problem without prior knowledge, these algorithms become inapplicable. On the other side, anonymity techniques [16,23] are also effective in database privacy protection. In these techniques, the performance is evaluated by the anonymity degree. A database with a higher anonymity degree is more likely to protect its data privacy. Unlike the privacy constraints set in database fragmentation algorithms, the anonymity degree can be calculated in all kinds of databases and does not require any prior knowledge. If we set the anonymity degree as the optimization objective, the database fragmentation algorithm can be used in all kinds of databases, including those lacking prior knowledge. Based on the above consideration, an anonymity-driven database fragmentation problem is defined.
For NP-hard optimization problems, various differential evolution (DE) algorithms have been proposed [7,8,18,19]. They have shown the advantages in efficiency and reliability in optimization problems, including seismic inversion [6], microwave circuit design [36], and protein structure prediction [38]. In DE, different mutation strategies can address search problems of different properties. A proper design of mutation and crossover operators helps achieve the tradeoff between exploration and exploitation. Considering the effectiveness of previous DE algorithms in the applications, it is worthy of applying the DE algorithm to the anonymitydriven database fragmentation problem.
In this paper, a set-based adaptive distributed differential evolution (S-ADDE) algorithm is proposed to address the anonymity-driven database fragmentation problem. The individual in S-ADDE represents the database fragmentation solution, and the anonymity degree of each solution is set as the fitness value of the individual. The update of the individuals in S-ADDE reflects the increase in anonymity degree in database fragmentation. Moreover, the contributions of this paper are listed as follows: 1. In order to maintain population diversity, we utilize an island model including four sub-populations; 2. Two set-based operators are proposed to transfer the continuous domain in traditional DE to discrete domain in the database fragmentation problem; 3. In the set-based mutation operator, each individual's mutation strategy is adaptively selected according to the evolving performance.
The remainder of this paper is organized as follows. Section 2 outlines the related work of database fragmentation for privacy protection, anonymity techniques, and recent DE applications. Afterward, the problem definition of the anonymity-driven database fragmentation problem is given in detail. In Sect. 4, a brief description of DE operators is given. Subsequently, we illustrate the proposed S-ADDE algorithm. In Sect. 6, extensive experiments and simulations are carried out, and the results are analyzed. Finally, in Sect. 7, we summarize this paper.

Related Work
Database fragmentation for data privacy protection was firstly proposed in [1]. In this work, all the database attributes are divided into two groups, which limits the performance in complex problems. Encrypted database fragmentation was introduced in [2], which has shown its advantage in breaking the sensitive association between attributes.
Authors in [3] proved that the database fragmentation problem is NP-hard. Furthermore, two heuristic strategies were proposed to achieve the optimal number of fragments. Based on the confidentiality constraints in database fragmentation, a graph-based algorithm was proposed in [29], in which the nodes in the graph represent the database fragmentation solutions. In [4], the authors investigated the effectiveness of loose association in database fragmentation. The evolutionary algorithm was also utilized in database fragmentation. In [10], a distributed memetic algorithm was proposed to tackle the outsourced database fragmentation problem, in which the database privacy is satisfied, and the database utility is optimized. These algorithms are designed based on predefined privacy constraints. When facing a database fragmentation problem laking of prior knowledge, these approaches are inapplicable.
On the other side, various anonymity techniques involving k-anonymity [23], l-diversity [16] were proposed for protecting the database privacy. The performance of the anonymity techniques is measured by the anonymity degree. One advantage of anonymity degree measurement is it does not require prior knowledge of the tackled database. Based on this advantage, the anonymity degree measurement can also be utilized in the database fragmentation problem.
In the last decades, DE algorithms have been widely utilized in actual applications and have shown their efficiency and reliability advantages. In [38], an underestimation-assisted global-local cooperative DE was proposed to enhance the effectiveness and the efficiency in optimization. The proposed algorithm was utilized in protein structure prediction and outperformed the competitors. Authors in [37] proposed a self-adaptive DE algorithm for addressing the batch-processing machine scheduling problem, in which the mutation operators and control parameter values are adaptively adjusted. In [22], a bi-objective elite DE was designed to optimize the multivalued logic networks. Two objective functions were used to simultaneously evaluate the fitness of the individuals in DE. In these applications, DE was redesigned according to the difficulty in the corresponding optimization problems.

Problem Definition
For a given relation schema R, a fragmentation F is legal if: 1. ∀r ∈ R ∶ ∃F ∈ F such that r ∈ F (a fragmentation cover all attributes) 2. ∀F i , F j ∈ F, i ≠ j ∶ F i ∩ F j = ∅ (fragments in the fragmentation do not have attributes in common) For anonymity-driven database fragmentation, the objective is to identify a solution that can achieve the highest 1 3 anonymity degree for relation schema R. Anonymity degree of fragmentation is calculated as: where anonymity(F) is the anonymity degree of fragmentation F . Anonymity degree of fragmentation is decided by its fragment of lowest anonymity degree. Suppose fragmentation F is the optimal solution for R, it should meet the following conditions: which means fragmentation F is of higher anonymity degree than all the other legal solution that can satisfy the first two conditions.

Differential Evolution (DE)
DE process includes three operators, i.e., mutation, crossover, and selection. An elementary description of these three operators is given as follows.

Mutation
The mutation operator utilizes the difference between individuals in the population to construct the mutant individuals. where g best indicates the best individual at generation g; r1, r2, r3, r4, and r5 are indexes of five random individuals; F is the differential factor.

Crossover
In this process, the mutant individual g i exchanges evolutionary information with the current individual g i and generates a trial individual g i , which is formulated as: where rand(0, 1) is a random float number between 0 and 1; j rand represents a random integer to guarantee at least one bit of g i comes from g i ; Cr represents the crossover rate.

Selection
In this stage, the fitness values of the current individual and the trial individual are compared. The individual with higher fitness value is selected and kept. For a minimization problem f(x), the selection process is formulated as follows: where g+1 i is the target individual at the next generation. When utilizing DE to solve the database fragmentation problem, DE's individuals can represent solutions of database fragmentation. The fitness values of the individuals represent the anonymity degrees of database fragmentation solutions. Thus, the improvement of fitness values in DE can help achieve better database fragmentation.

Set-Based Adaptive Distributed Differential Evolution (S-ADDE)
In this section, the proposed S-ADDE algorithm is described in detail. The representation manner of the S-ADDE algorithm is first introduced. Then, the island model of S-ADDE is introduced. Afterward, a set-based mutation operator, an adaptive strategy of mutation strategy selection, and a set-based crossover operator are proposed. Finally, to illustrate the overall S-ADDE process, the pseudo-code of S-ADDE is given and explained.

Representation
In Fig. 1, a sample database is given, which contains nine attributes and six records. As shown in the figure, the database is divided into three fragments. These three fragments make up a fragmentation solution shown at the bottom of the figure.
Each individual in the proposed S-ADDE algorithm represents a database fragmentation solution. Accordingly, each bit in the individual indicates one attribute in the database, and its value indicates which fragment the corresponding attribute is chosen for allocation.

Island Model
DE algorithm with the distributed framework has shown its advantages in optimization performance and speed. One classic distributed framework of the DE algorithm is the island model. The DE algorithm population is divided into several sub-populations in the island model, and each subpopulation evolves independently. According to the communication topology, sub-populations share their elite individuals with a given interval, which is referred to as the migration interval. Once one sub-population receives the migrated elite individuals, individuals in the current generation are randomly selected and replaced.
In the proposed S-ADDE algorithm, an island model with a ring communication topology is utilized. An example of the island model is given in Fig. 2. As shown in the example, each big circle indicates a sub-population. In the big circles, small triangles and circles represent the best individuals and  By dividing the entire population of the DE algorithm into several sub-populations with independent evolution, the island model can help the S-ADDE algorithm maintain the population diversity. By migrating elite individuals in the sub-populations, the island model can enhance the S-ADDE algorithm's population quality. If the migration operator is appropriately executed, the S-ADDE algorithm can achieve the trade-off between exploration and exploitation. Moreover, since each sub-population evolves independently, the island model can be directly implemented in a distributed manner, which is crucial for speedup in evolution.

Set-Based Mutation
As mentioned above, the traditional DE algorithm's mutation strategies are designed to optimize the continuous optimization problems. In our tackled database fragmentation problem, the fragmentation solutions are of the discrete domain. One challenge of mutation strategy design is how to transfer the continuous domain in traditional mutation strategies to the discrete domain in the database fragmentation problem. In the S-ADDE algorithm, a set-based mutation operator is proposed. To be specific, the fragment solution in the individual of S-ADDE is regarded as a set. Thus, the calculation between two individuals is transferred to the set calculation between two database fragmentation solutions. In DE mutation operators, two kinds of calculation rules are involved. The first part is to calculate the difference between two individuals. The second part is to calculate the sum of two individuals. When calculating the difference and the sum for each bit, if the values of two individuals are the same, the same value is kept. If the values of two individuals are different, both two values are kept as a set. With the help of the above calculation rules, all the existing mutation strategies can be utilized in discrete optimization problems.
As mentioned above, when tackling discrete optimization problems, traditional mutation strategies cannot be directly utilized. In the proposed set-based mutation operator, the calculation between two individuals is transferred to the set calculation between the elements in the individuals. When utilizing these two calculation rules in the traditional mutation strategies, all the mutation strategies can be utilized. For instance, in DE/rand/1 mutation strategy, the difference between two randomly selected individuals is firstly calculated and then added to the third individual. Accordingly, the elements only contained by the first random individual are extracted and added to the fragments in the third random individual.

Adaptive Mutation Strategy Selection
In the set-based mutation operator, the calculation between individuals in the mutation strategy is transferred to the set-based calculation between fragmentation solutions. The set-based calculation can fit all the mutation strategies and achieve its effect. Every mutation strategy has a unique effect on optimization. For example, DE/rand/1 strategy can enhance the population strategy and outperform the multimodal search problems. In contrast, the DE/best/1 strategy can enhance the DE algorithm's exploitation ability and achieve high performance in single-modal search problems. An adaptive strategy for selecting a proper mutation strategy during the evolution is designed based on this background. All the chosen mutation strategies are inserted into a strategy pool, and the mutation strategy of the each individual is adaptively selected.
At the beginning of the evolution, each individual chooses one mutation strategy from the strategy pool. After executing the chosen mutation strategy, if the generated individual is kept in the population, which indicates higher competitiveness, the current mutation strategy will be utilized in the next generation. Otherwise, a mutation strategy is randomly chosen from the strategy pool, and the current mutation strategy is replaced.
If a mutation strategy can fit a search problem well, it is more likely to generate better individuals and of higher probability to be utilized. On the contrary, a mutation strategy that cannot achieve ideal performance is less likely to be utilized. In various search problems, the probability of executing every mutation strategy is adaptively adjusted, which makes the proposed S-ADDE outperform.

Set-Based Crossover
After the set-based mutation operator, the chosen individuals' evolutionary information is extracted and used to construct a mutant individual. In the mutant individual, some bits may contain more than one element. According to the above problem definition, in a legal solution for database fragmentation, each element must be involved in one fragment and only appear once, which indicates a fragmentation covering all attributes, and the fragments do not have attributes in common. To make the mutant individual from the set-based mutation operator legal, if one bit contains two or more elements, one element is chosen by random. Afterward, the legal mutant individual i exchanges the evolutionary information with the current individual i . For each element, with the crossover rate, it is assigned in the same fragment as individual i .
With the proposed set-based crossover operator's help, fragmentation information from the mutant individual and the current individual is exchanged. Since the new individual's redundant elements are removed, for each element, it is only contained by one fragment and will not appear more than once in the generated individual. Randomly, an individual containing fragmentation information from two individuals is generated. If the generated individual is better than the current individual, it will be kept, and the competitiveness of the entire population is improved.

Overall Process
The pseudo-code of S-ADDE is given in Algorithm 1. As shown in the pseudo-code, a master-slave model is utilized to implement the S-ADDE algorithm. At the master node, the entire population is divided into N sub-populations and sent to the corresponding N slave nodes. With the predefined migration interval MI, the master node receives the migrated individuals from the slave nodes and sends these elite individuals to the corresponding slave nodes. The the set-based mutation operator is firstly carried out, and the corresponding mutant individual is generated. Then, the mutant individual is transferred to be legal in the set-based crossover operator and exchanges the current individual's evolutionary information. If the newly generated individual is more competitive, it is kept in the population, and the current mutation strategy is kept. Otherwise, the current individual in the population is kept, and a mutation strategy is randomly selected from the mutation strategy pool. With the given migration interval MI, the slave node sends its best individual to the master node and receives one elite individual from the master node. The migrated elite individual replaces one randomly chosen individual in the current population.

Experimental Setup
In the experiments, 16 test cases are utilized to verify the effectiveness of the proposed S-ADDE algorithm in database fragmentation. These test cases are generated based on the public datasets of New York State Department of Health. 1 Table 1 outlines the properties of these test cases, which include numbers of records NR, numbers of sample records NSR, numbers of attributes NA, and numbers of sites NS.
In the proposed S-ADDE algorithm, the number of subpopulations in S-ADDE is set as 4; the sub-population size  migration process is executed until the terminal condition is satisfied. Finally, the fragmentation solutions of the subpopulations are collected, and the best fragmentation solution is outputted. At the slave node, each sub-population evolves independently. At the beginning of the evolution, a mutation strategy is randomly selected from the mutation strategy pool for each individual. For each individual in the sub-population, SPS is set as 10; the migration interval MI is set as 10; the crossover rate Cr is set as 0.5. For all the algorithms, the maximum fitness evaluation number is set as NS × 10 3 .
The island model of S-ADDE is implemented by the Message Passing Interface (MPI). Each sub-population is assigned to a computation core in the CPU. The communication between sub-populations is implemented by the message passing between CPU cores.

Comparisons with Competitive Approaches
To verify the effectiveness of the proposed S-ADDE algorithm on test cases, two competitive approaches, i.e., heuristic algorithm [3] and differential evolution [19], are compared. The specific description of these two competitive approaches are listed as follows: 1. HA [3]: This is a state-of-the-art heuristic algorithm for database fragmentation problem, in which two heuristic strategies are designed to achieve the optimal fragmentation. 2. DE [19]: This approach acts as a baseline algorithm. The differences between DE and S-ADDE in performance show the effectiveness of the proposed operators. 3. S-DDE [9]: In this algorithm, the database fragmentation problem is optimized by set-based mutation and crossover operators. Different from the proposed S-ADDE, its mutation strategy is not adaptively selected.
Furthermore, on each test case, the proposed S-ADDE and the compared approaches are performed in 25 independent runs. The average and standard deviation values obtained by all the approaches are calculated and listed in Table 2. The best results are labeled in boldface. S-ADDE algorithm can outperform the other approaches on all the test cases. Due to the exploration ability, S-ADDE can achieve a better performance than HA. In HA, the search direction is led by the predefined heuristic strategy and its search direction Firstly, the island model maintains the population diversity, which is helpful in complex test cases. Secondly, the individuals' evolutionary information is effectively extracted by the set-based operators and used to reconstruct offsprings. When comparing with S-DDE, the advantage of the adaptive mutation strategy selection in S-ADDE is verified. With the help of the proposed adaptive strategy, S-ADDE can achieve a better balance between the exploratory and exploitative search. Therefore, S-ADDE can outperform S-DDE on 12 out of 16 test cases.
The obtained convergence curves on four typical test cases are plotted in Fig. 3. Three lines indicate the convergence curves obtained by HA, DE, and S-ADDE, respectively. For each point, the value on the horizontal axis represents the number of fitness evaluations, while the vertical axis represents the average anonymity degrees obtained in 25 independent runs. In the beginning, all three algorithms converge rapidly. HA quickly gets trapped by the local optimum and stagnates. Thanks to the exploration ability of DE and S-ADDE, they can continuously improve the anonymity degree during the search. The differences between the green lines of S-ADDE and the red lines of DE verify the effectiveness of the island model and the proposed set-based operators in S-ADDE. Moreover, the proposed adaptive strategy can help accelerate the convergence speed of the proposed S-ADDE. Compared with S-DDE, we can find that S-ADDE can achieve higher values of anonymity degree during the entire optimization process. Overall, S-ADDE can achieve the highest convergence speed due to the trade-off between exploration and exploitation.
Based on the definition of the database fragmentation problem, the time complexity of calculating the anonymity degree for a given fragmentation is O(NA × NR × log NR) . Thus, the time complexity of the proposed S-ADDE algor it hm is O(NA × NR × log NR × NS) . According to the original papers, the time complexity of HA is O(NA 2 × NR × log NR) . The time complexity of DE and S-DDE is O(NA × NR × log NR × NS) , which is the same as S-ADDE. Overall, in terms of the time complexity, S-ADDE and all three compared algorithms do not show significant differences. Moreover, since S-DDE and S-ADDE are implemented based on the distributed framework, the parallel computation can help reduce the running time.

Effect of Adaptive Mutation Strategy Selection
In S-ADDE, the mutation strategy is adaptively selected. An experiment is carried out to verify its effect. Besides the proposed S-ADDE algorithms, three variants of S-ADDE adopting different mutation strategies are implemented and listed as follows: 1. S-ADDE-rand This variant adopts DE/rand/1 as the mutation strategy. The mutation strategy is not changed during the entire evolution. To be noted, except for the adaptive mutation strategy selection part, all the compared variants share the same distributed island model and operators as the original S-ADDE. In Table 3, the average and standard deviation obtained by variants and the original S-ADDE are listed. The best results are labeled in boldface. In total, S-ADDE can outperform on 13 test cases; S-ADDE-current-to-best can outperform on 6 test cases; S-ADDE-rand and S-ADDEbest cannot outperform on any test case. Compared with S-ADDE-current-to-best, although it can outperform 6 test cases, the proposed S-ADDE can achieve the same results on 3 test cases. Compared with the other two variants, i.e., S-ADDE-rand and S-ADDE-best, the proposed S-ADDE shows its advantages in all the test cases. Although these two variants do not outperform in the comparison, they are effective during the cooperation with other mutation strategies. This point can be verified by the performance of S-ADDE, which utilizes the adaptive mutation strategy selection.

Effect of S-ADDE Results on Original Datasets
In this section, the anonymity degrees of original datasets without database fragmentation and with database fragmentation are compared. As shown in Table 4, AD indicates the anonymity degree of each dataset. Values min(AD), avg(AD), and max(AD) represent the minimal, average, and maximum values of anonymity degree obtained by the fragments in S-ADDE.
According to Table 4, it is clear that S-ADDE can afford significant enhancement to original datasets in anonymity. Without S-ADDE, the anonymity degree of original datasets is around 1, which means each record in the original datasets can be directly identified according to its characteristics. With the help of the S-ADDE database fragmentation result, the anonymity degree of the original dataset in each fragment increases. Focusing on the column of min(AD), which indicates the minimal anonymity degree obtained by all the fragments, it is obvious that the privacy

Speedup Ratio
In the distributed algorithm, the speedup ratio is an important indicator. It can reflect the computation efficiency of the distributed algorithm. The speedup ratio is calculated by dividing the running time taken by the distributed algorithm into the sequential algorithm's running time. A distributed algorithm with a higher speedup ratio can achieve higher distributed computation efficiency, which is crucial in maintaining the algorithm's scalability.
In the proposed S-ADDE algorithm, each sub-population is allocated to a single computation core, and each sub-population evolves independently. Thus, the number of sub-populations in S-ADDE directly reflects its parallel granularity. S-ADDE algorithms' running time with different numbers of sub-populations (1, 2, 4, 6) is measured. The S-ADDE algorithm with a single sub-population is regarded as the sequential algorithm, and the S-ADDE algorithms with multiple sub-populations are regarded as the distributed algorithms.
In Fig. 4, the speedup ratios of S-ADDE on 16 test cases are plotted. The speedup ratios significantly increase with the parallel granularity of S-ADDE increases from two to six. The speedup ratio curves in different test cases vary. This is because different test cases are of different complexity and need different evaluation time. In general, the communication time of S-ADDE on different test cases does not have a significant difference. Thus, a test case of higher evaluation time, such as T 15 and T 16 , can help achieve speedup ratios. When adopted in actual optimization problems, which contains higher complexity, the proposed S-ADDE algorithm can further show its scalability and speed advantages.

Conclusion
An anonymity-driven database fragmentation problem has been defined in this paper. To address this problem, we have proposed the S-ADDE algorithm. S-ADDE algorithm utilizes an island model to improve the population diversity, which is crucial in complex search problems. Moreover, we have proposed two set-based operators, i.e., set-based mutation operator with adaptive mutation strategy selection and set-based crossover operator. According to the analysis of the experimental results, the proposed S-ADDE algorithm is significantly better than the compared algorithms. The computation efficiency of S-ADDE has been investigated and the effectiveness of the proposed operators has been verified.
In this paper, the privacy issue (i.e., anonymity degree) of the database fragmentation is optimized. Furthermore, the utility issue (e.g., communication cost) of the database fragmentation can be further investigated and optimized in future work. In addition, considering the effectiveness of the proposed set-based operators and the adaptive selection strategy, we can further apply them in the other discrete optimization problems, such as the database allocation problem.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.