1 Introduction

In the present era, data assumes a critical role in the daily lives of individuals [1,2,3,4,5,6]. The dissemination and utilization of data [7,8,9,10,11,12,13] have created enormous opportunities for decision-making and knowledge exploration [5, 14,15,16,17,18,19,20,21,22,23]. For instance, in 2006, Netflix released a dataset comprising 100 million movie ratings to enhance its recommendation system’s performance [24]. However, despite the significant advantages of data publication, concerns on data privacy preservation [25,26,27,28,29,30,31]. Consequently, privacy-preserving data publishing (PPDP) has emerged as a critical area of research, which aims to create an anonymous dataset that safeguards privacy while maintaining optimal data utility levels. This objective can be achieved through various privacy-preserving techniques such as data anonymization, generalization, and perturbation.

When it comes to PPDP, two main categories of approaches exist: decreasing the precision of the original dataset, and data perturbation [16]. In the first category, a well-known approach was introduced in [32] that uses a binary search on the generalization lattice to identify the anonymization solution. Kohlmayer et al. [33] presented a comprehensive framework for optimal anonymization, which enabled the Flash algorithm to find the optimal anonymization solution by searching the path in the lattice. An algorithm proposed in [34] optimized the anonymization solution in an identical generalization hierarchy, which is useful for protecting data privacy in the general Internet of Things (IoT) environment. However, existing works mostly focus on single anonymization operations (such as attribute generalization or record suppression), which may not be effective from the perspective of information release. Therefore, it is worth considering combining multiple anonymization operations when optimizing the anonymization solution. Moreover, existing works mostly adopt graph search-based strategies to optimize the anonymization solution, but these approaches may lose their effectiveness when the search space of the PPDP problem becomes complex. Ge et al. [35] formulated the multi-objective data publishing problem and proposed a distributed cooperative coevolution evolutionary framework to achieve efficient optimization. In the second category, differential privacy represents one of the typical approaches that ensure no significant difference in query results when inserting one record [36, 37]. These approaches are effective in addressing data privacy requirements in queries. However, they are not suitable for scenarios requiring data transparency and truthfulness.

The genetic algorithm (GA), as discussed in previous research [38,39,40], is an algorithmic approach that involves a stochastic search mechanism based on the principles of natural competition and selection [41,42,43]. By utilizing a population model, GA is able to maintain a diverse search direction and facilitate the production of high-quality solutions. The widespread use of GA in various optimization problems [44,45,46,47] can be attributed to its advantages in high search efficiency and robustness.

This paper presents the information-driven distributed genetic algorithm (ID-DGA). The proposed algorithm optimizes anonymization solutions using a combination of attribute generalization and record suppression techniques. ID-DGA is designed based on a distributed population model to improve population diversity. Besides, ID-DGA incorporates a specifically designed information-driven crossover operator that facilitates the exchange of information between anonymization solutions and promotes information release. In addition, ID-DGA employs an information-driven mutation operator to enhance population diversity and information release. Furthermore, the proposed information-driven improvement operator helps adaptively refine the anonymization solutions. Finally, a two-dimensional selection operator is introduced to enhance individual competitiveness and population quality.

The paper is structured as follows. Section 2 provides an overview of the related work in the field of PPDP. Section 3 formally defines the PPDP problem. Section 4 presents the proposed ID-DGA in detail. Sections 5 and 6 outline the experimental setup used in this study and present an analysis of the experimental results. Finally, Section 7 offers concluding remarks to wrap up the paper.

2 Related work

In [16], a survey regarding PPDP was presented. In this survey, the related techniques of PPDP were systematically summarized. These techniques were designed according to four attack models, i.e., record linkage, attributed linkage, table linkage, and probabilistic attack. Moreover, the anonymization operations and information metrics were introduced. The previous privacy models can be divided into two categories in terms of mechanism. The first category is based on decreasing the precision of the original dataset to achieve the given specific privacy criteria, including k-anonymity, l-diversity, and t-closeness. The second category is designed based on perturbation to guarantee that no significant difference is shown in the query results when inserting one record.

For the first category, various approaches have been proposed. One of the most important approaches was proposed in [32], where a binary search was performed on the generalization lattice for the solution. Afterward, the optimal k-anonymity problem was proven to be an NP-hard problem [48]. In [49], an algorithm named Incognito performs a bottom-up, breadth-first search of the generalization lattice. In [50], an algorithm named optimal lattice anonymization was proposed. In this algorithm, the generalization lattice was divided into several sub-lattices, and the optimal solution was found by searching within each sub-lattice. In [33], a generic framework for optimal k-anonymity was presented. Based on the proposed framework, an algorithm named Flash was developed to perform the search for the optimal node in the lattice on each built path. In [51], the authors presented an algorithm for k-anonymization of time-varying datasets. Based on micro-aggregation, such an algorithm can support adding, deleting, and updating records while keeping its k-anonymity property. Authors in [52] tackled the semantic attack in trajectory data publishing. An algorithm providing privacy protection against semantic and re-identification attacks was proposed. In [34], a special case of dataset called identical generalization hierarchy was considered, whose solution is effective to address the general IoT data privacy protection. Accordingly, an algorithm for the globally optimized k-anonymity solution was designed.

For the second category, differential privacy [36, 37] was proposed. Differential privacy focuses on data privacy in queries. In differential privacy, any two datasets with a one-record difference should answer similar results to the same query. In [53], a variant of differential privacy named local differential privacy was tackled. Accordingly, a local differentially private high-dimension data publication algorithm was designed based on distribution estimation. In [54], a compressed sensing mechanism was proposed for differential privacy based on the compressed sensing framework while guaranteeing the accuracy of query results. However, different privacy approaches are not applicable in PPDP scenarios that require data transparency since the introduced noise by differential privacy approaches cannot guarantee data truthfulness.

3 Problem definition

As the data publisher, the objective of PPDP is to transfer the original dataset D to an anonymous T that can satisfy the given privacy requirement determined by a privacy model and maintain its utility as high as possible.

In D, quasi-identifiers (QIDs) are attributes that could potentially identify the owners of records in the dataset. During the anonymization, various anonymization operations such as generalization and suppression can be utilized on QID and transfer QID to QID’ in T.

In our definition, the k-anonymity criterion is set as the privacy model and defined as:

Definition 1

(k-anonymity) A dataset satisfies the k-anonymity requirement if each combination of QID’ attributes exists in at least k records.

Accordingly, the anonymity degree (AD) value of a k-anonymity T equals k. The objective of PPDP is to identify the optimal anonymization and is defined as:

Definition 2

(Optimal anonymization) For T, an optimal anonymization solution can satisfy the privacy requirement (\(\text {AD}(T)\ge k\)) and achieves the highest utility degree.

The utility of T is calculated according to its transparency degree (TD) [16]:

$$\begin{aligned} \text {TD}(T)=\sum _{r\in T}{\text {TD}(r)} \end{aligned}$$
(1)
$$\begin{aligned} \text {TD}(r)=\sum _{v_g\in r}{\text {TD}(v_g)} \end{aligned}$$
(2)

where r indicates the record in T; \(v_g\) is the generalized value in record r. TD value of \(v_g\) is calculated as:

$$\begin{aligned} \text {TD}(v_g)=\frac{1}{\left| v_g \right| } \end{aligned}$$
(3)

where \(\left| v_g \right| \) is the number of domain values that are descendants of \(v_g\).

Fig. 1
figure 1

Illustration of representation in ID-DGA, where a sample dataset containing three QID attributes and four records is given

4 ID-DGA

This section presents an overview of the proposed ID-DGA. Firstly, we present the distributed population model utilized in ID-DGA. Afterward, we discuss the representation of individuals in ID-DGA (Figure 1). Subsequently, the information-driven crossover, mutation, and improvement strategies employed by ID-DGA are illustrated in detail. Additionally, we introduce the two-dimensional selection operator utilized in ID-DGA. Finally, the entire procedure of ID-DGA is illustrated to provide a comprehensive understanding of the algorithm.

Fig. 2
figure 2

An example of the distributed population model in ID-DGA

4.1 Distributed population model

In the distributed population model, the entire population of ID-DGA is divided into several sub-populations, and each sub-population evolves independently. All the sub-populations communicate according to the predefined topology. With the help of the communication topology, sub-populations share their elite individuals with a given interval, which is referred to as the migration operator. Once one sub-population receives the migrated elite individuals, individuals in the current generation are randomly selected and replaced.

In our proposed ID-DGA, the distributed population model utilizes a ring communication topology. An example of the distributed population model is given in Figure 2. As shown in the example, each big circle indicates a sub-population. In the big circles, small triangles and circles represent the best individuals and the other sub-population individuals. The best individuals in sub-populations are sent to the neighborhood sub-populations on the communication topology with the predefined migration interval. Afterward, one individual in each sub-population is chosen by random and replaced by the received elite individual.

By dividing the entire population of the ID-DGA into several sub-populations with independent evolution, the distributed population model can help the ID-DGA improve the population diversity. By migrating elite individuals in the sub-populations, the island model can enhance the ID-DGA’s population quality. If the migration operator is appropriately executed, the ID-DGA can achieve the trade-off between exploration and exploitation. Moreover, since each sub-population evolves independently, the island model can be directly implemented in a distributed manner, which is crucial for speedup in evolution.

4.2 Representation

Figure 1 depicts a sample dataset and its anonymization solution. The dataset consists of four records and three quasi-identifier (QID) attributes. The anonymization solution comprises two vectors: a vector for attribute generalization denoted by “G” and a vector for record suppression denoted by “S”. In vector G, each QID attribute is generalized based on its level, while in vector S, each record is suppressed based on its corresponding value. Specifically, a value of “0” indicates that the record is removed, whereas a value of “1” indicates that the record is retained.

In ID-DGA, an individual represents an anonymization solution, which includes two vectors – vector G and vector S. The length of vector G corresponds to the number of QID attributes, while the length of vector S corresponds to the number of records. Throughout the update process of individuals in ID-DGA, the competitiveness of anonymization solutions is enhanced.

4.3 Information-driven crossover

The information-driven crossover operator involves the use of two distinct strategies. During the exchange of information between two individuals, the vectors G and S are subjected to separate information exchange strategies. The exchange process for vector G entails randomly selecting one of two possible values for each bit of the offspring, based on the corresponding bit in the parent individuals. On the other hand, vector S is subjected to OR gate rules, whereby each bit of the offspring takes the value of one if at least one of the corresponding bits in the parent individuals is one, and zero otherwise. It is worth noting that these strategies operate independently of each other, and they are specifically tailored to enhance the exchange of information between the parent individuals.

Fig. 3
figure 3

Illustration of information-driven crossover operator, in which the information in two anonymization solutions is exchanged

Figure 3 provides an illustrative example of the crossover operator, which involves two individuals containing two G vectors (\(\text {G}_1\) and \(\text {G}_2\)) and two S vectors (\(\text {S}_1\) and \(\text {S}_2\)). The crossover operator is executed separately on each vector, and this results in the generation of two offspring vectors (\(\text {G}_{1\times 2}\) and \(\text {S}_{1\times 2}\)). During the crossover process for vector \(\text {G}_{1\times 2}\), the value of each bit is randomly selected from the corresponding bits in \(\text {G}_1\) and \(\text {G}_2\). For instance, the value of the first bit in \(\text {G}_{1\times 2}\) is chosen from \(\text {G}_1\) (i.e., 2) and \(\text {G}_2\) (i.e., 1), and then randomly selected from \(\text {G}_1\). Similarly, the values of the second and third bits in \(\text {G}_{1\times 2}\) are chosen from \(\text {G}_1\) and \(\text {G}_2\), respectively. In the same vein, the crossover process for vector \(\text {S}_{1\times 2}\) involves the application of OR gate rules, where each bit of the offspring vector takes the value of one if at least one of the corresponding bits in the parent individuals is one, and zero otherwise. For example, the values of the first, second, and third bits in \(\text {S}_{1\times 2}\) are all one, as a calculation result of the values in \(\text {S}_1\) and \(\text {S}_2\). Conversely, the value of the fourth bit in \(\text {S}_{1\times 2}\) is zero, since both values in \(\text {S}_1\) and \(\text {S}_2\) are zero. It is worth noting that the crossover process for the G and S vectors is independent and optimized to facilitate the efficient exchange of information between the parent individuals.

Our proposed information-driven crossover operator facilitates the exchange of information between parent individuals. This operator randomly exchanges the values of two G vectors, leading to a mixture of generalization levels in the resulting anonymization solutions. If the parent individuals meet the privacy preservation requirement, it is likely that their offspring solution will also satisfy the same requirement. For two S vectors, their values are accumulated. As long as one record in one anonymization solution is released, the corresponding record in the offspring anonymization solution is released. Thus, more information is released in the offspring anonymization solution.

Fig. 4
figure 4

Illustration of information-driven mutation operator, in which two vectors in the solution are adjusted separately

4.4 Information-driven mutation

The information-driven mutation operator handles vectors G and S independently. In the vector G, a single bit is randomly selected using the predefined mutation rate, MR, and its value is then initialized within the boundary of its generalization. On the other hand, vector S undergoes a similar mutation process, where a random bit is selected, and its value is changed to one using the same mutation rate, MR. This change in value results in the release of the corresponding record.

Figure 4 provides an illustration of the mutation operator in action. Specifically, the mutant versions of vectors \(\text {G}_1\) and \(\text {S}_1\) are denoted as \(\text {G}_{1*}\) and \(\text {S}_{1*}\), respectively. In the case of \(\text {G}_1\), a random selection is made for its third bit. The value of this bit is then changed from three to two, thereby altering the generalization level of the corresponding QID attribute. As for \(\text {S}_1\), the mutation process involves randomly selecting its fourth bit and changing its value from zero to one. Consequently, the fourth record in the mutated anonymization solution is disclosed.

Upon executing the proposed mutation operator, the anonymization solutions are adjusted in a random manner. Vector G undergoes changes in the generalization levels of randomly selected QID attributes. This process may lead to the creation of an anonymization solution that attains a higher degree of anonymity or transparency. In vector S, the records in the randomly chosen positions are released. It is likely to generate an anonymization solution that can achieve a higher transparency degree while satisfying the privacy requirement.

4.5 Information-driven improvement

The information-driven improvement operator is utilized to adaptively refine the child individual. More specifically, for each individual I whose AD value cannot satisfy the privacy preservation requirement, the information-driven improvement operator is utilized to improve its competitiveness. In I, the AD value of each record is calculated. Afterward, all the records with the lowest AD value are selected and the corresponding values in vector S are set as 0, meaning that these records are removed in the improved anonymization solution.

The proposed information-driven improvement operator is efficient in improving the AD values of individuals. At the beginning of the evolution, such an improvement operator is effective in improving the ratio of individuals that can satisfy the privacy preservation requirement. Afterward, such an improvement operator is helpful in transferring the ineligible individuals with high TD values to be eligible.

4.6 Two-dimension selection

In evaluating the quality of anonymization solutions, two indicators, namely AD and TD, are employed. The optimal anonymization solution, as per the problem definition, is one that achieves the highest TD while also satisfying the requirement stipulated by AD. The prioritization of these two indicators should vary depending on the specific situation. To this end, three rules have been formulated.

  1. 1.

    If neither of two individuals satisfies the privacy preservation requirement, the individual with a higher AD value is considered more competitive.

  2. 2.

    If only one individual satisfies the privacy preservation requirement, that individual is deemed more competitive.

  3. 3.

    If both individuals satisfy the privacy preservation requirement, the individual with a higher TD value is considered better.

Figure 5 displays three pairs of individuals, each evaluated based on the three comparison rules. In every pair, the individual represented by a circle is deemed more competitive than the one represented by a triangle. The first rule is applied in the first pair. Although neither individual can satisfy the privacy protection requirement, the circle individual has a higher AD value. Therefore, the circle individual is deemed more competitive. In the second pair, the circle individual is superior as it fulfills the privacy protection requirement. Finally, in the third pair, both individuals can satisfy the privacy protection requirement, but the circle individual still prevails due to its higher TD value.

Fig. 5
figure 5

Illustration of two-dimension operator, where three pairs of solutions are compared according to three defined rules

Such a two-dimension selection operator can effectively improve the population quality in ID-DGA. When no individual in the population can reach the privacy protection requirement, the individuals with higher privacy degrees are kept in the population. Thus, the entire population can approach the privacy protection requirement during the update. When part of the individuals in the population can reach the privacy protection requirement, these individuals are kept in the population. Finally, when most of the population can reach the privacy protection requirement, the individuals with higher TD values are kept in the population to improve the population quality. The implementation of this two-dimensional selection operator is therefore an effective approach to improving population quality in ID-DGA.

Algorithm 1
figure a

Pseudo-code of ID-DGA.

4.7 Overall procedure

The entire procedure of the ID-DGA is described in Algorithm 1. As shown in the pseudo-code, a master-slave model is utilized to implement the ID-DGA. At the master node, the generation index g is set as zero. Then the entire population is divided into NSP sub-populations and sent to the corresponding NSP slave nodes. With the predefined migration interval MI, the master node receives the elite individuals from all the slave nodes. Then it sends these elite individuals to the corresponding slave nodes according to the ring topology. The migration process is executed until the terminal condition is satisfied. Finally, the best anonymization solution to the given dataset is outputted.

At the slave node, each sub-population evolves independently. During the evolution, in each generation, for each pair of parent individuals, the information-driven crossover operator is executed to exchange the anonymization information in parent individuals and generate the child individual. Afterward, the information-driven mutation operator is carried out on the child individual to improve the population diversity. After the mutation operator, if the mutant individual cannot satisfy the privacy preservation requirement, the mutant child individual is adjusted by the information-driven improvement operator. Subsequently, the child individual is evaluated and compared with the parent individuals by the selection operator. If the child individual is better than any parent individual, one of the parent individuals will be replaced. Otherwise, the mutant child individual will not be kept in the population. Then, the migration operator is carried out with the predefined mutation interval MI. Each slave node sends the best individual to the master node and receives one elite individual from the master node. Afterward, one randomly chosen individual in the sub-population that is not the best individual will be replaced by the received migrated individual. Finally, the best individual is returned to the master node.

Table 1 Properties of 16 test instances

5 Experimental setup

This section illustrates the test instances, parameters settings, and algorithm implementation in the following experiments.

5.1 Test instances

In the subsequent experimental studies, 16 test instances are utilized to investigate the performance of the proposed ID-DGA. These test instances are generated based on the public datasets released by the New York State Department of HealthFootnote 1. Table 1 outlines the properties of these test instances, including the number of attributes nA, the number of QID attributes nQID, and the number of records nR. In addition, in each test instance, the privacy requirement of anonymity degree k is set as 2.

5.2 Parameter settings

In the proposed ID-DGA, population size N is set as 40 and number of sub-populations NSP is set as 4; mutation rate MR is set as 0.1; migration interval MI is set as 5. For all the algorithms, the maximum fitness evaluation number is set as \(nQID\times nR\).

5.3 Algorithm implementation

ID-DGA and all the compared algorithms in this paper are implemented in C++ and performed on a local compute node (OS: Ubuntu 16.04; CPU: 16-Core Intel i9-12900K; Memory: 16GB).

6 Experimental result

In this section, we verify the advantages of the proposed ID-DGA by comparing it with the baseline algorithm GA, a competitive optimal anonymization algorithm Flash, and an information-driven genetic algorithm (ID-GA). Moreover, the effect of all the proposed operators is investigated.

6.1 Comparison with existing approaches

To verify the effectiveness of the proposed ID-DGA algorithm, four existing algorithms, i.e., GA [38], DE [55], Flash [33], and ID-GA [56] are utilized for comparison. These three algorithms are listed as follows:

  1. 1.

    GA [38]: This algorithm acts as a baseline algorithm in the comparison. When compared with the proposed ID-DGA, the effect of our designed operators in ID-DGA is confirmed.

  2. 2.

    DE [55]: In this algorithm, each privacy-preserving solution is represented by an individual of differential evolution (DE), and the competitiveness of the solution is improved through the mutation, crossover, and selection operators.

  3. 3.

    Flash [33]: In this paper, a generic framework for globally-optimal k-anonymity was presented. Furthermore, an algorithm based on a binary search was proposed based on the proposed framework.

  4. 4.

    ID-GA [56]: In this paper, an information-driven genetic algorithm was designed to achieve the optimal anonymization based on attribute generalization and record suppression.

Table 2 Comparison with existing approaches

In Table 2, the mean and standard deviation values of TD over 25 independent runs are presented, and the best results are highlighted in boldface. Overall, our proposed ID-DGA can outperform the compared existing algorithms on all the 16 test instances. Compared with GA, the advantages of ID-DGA in information exchange is verified. With the help of the proposed information-driven crossover, mutation and improvement operators, individuals can effectively identify solutions that can achieve higher TD values while reaching the requirement of privacy preservation. Compared with DE, the advantage of our proposed ID-DGA in the discrete-domain optimization is verified. Since the privacy-preserving solutions are in the discrete domain, the crossover and mutation operators in ID-DGA outperform the corresponding operators in DE since they can achieve higher efficiency in information exchange in the discrete domain. Besides, ID-DGA is more likely to produce eligible privacy-preserving solutions. Compared with Flash, the advantage of ID-DGA in search efficiency is verified. With the increase of attribute numbers, the complexity of such an optimization problem promptly increases. In this situation, the individuals in ID-DGA are more likely to maintain diversity and identify more competitive solutions. Compared with ID-GA, the advantage of the distributed framework in population diversity preservation is verified. Thus, ID-DGA achieves a better balance between exploration and exploitation.

Moreover, to investigate the advantage of ID-DGA in a statistical sense, the Wilcoxon rank-sum test with a 0.05 level is utilized. In Table 2, the symbol \(^\dagger \) shows that the corresponding result is significantly better than the compared results. Overall, ID-DGA can obtain significantly best results in all the 16 test instances.

Fig. 6
figure 6

Convergence curves of ID-DGA and compared algorithms on six typical test instances

In Figure 6, the convergence curves of ID-DGA and two compared existing algorithms are plotted. In this figure, four algorithms are indicated by four symbols with different colors. For each point, the value on the horizontal axis represents the number of fitness evaluations, and the vertical axis represents the value of TD. Compared with GA, the advantage of ID-DGA in search efficiency is verified. Due to the proposed information-driven crossover, mutation and improvement operators, ID-DGA can achieve a higher convergence speed during the entire process. ID-DGA has an advantage in discrete-domain optimization and eligible solution identification compared to DE. At the beginning of the search, ID-DGA outperforms DE due to its advantages in population diversity provided by the multi-population model and efficiency in identifying eligible discrete solutions. Afterward, the advantage of ID-DGA in solution refinement is verified, achieving a higher convergence speed than DE. Compared with Flash, ID-DGA shows its advantage in population diversity and continuous search ability. Although the heuristic strategy in Flash can achieve quick convergence at the beginning of the optimization, it is then trapped by the local optima due to the limitation of population diversity. Compared with ID-GA, ID-DGA can achieve a higher convergence speed since the distributed population model of ID-DGA helps improve the population diversity. Overall, our proposed ID-DGA can achieve the highest convergence speed during the entire process in all six typical test instances.

Table 3 Comparison with existing approaches on query precision (%)

Moreover, in Table 3, we compare the query precision produced by the existing approaches and our proposed ID-DGA. According to Table 3, we can see that in all the test instances, ID-DGA can achieve the highest query precision. The query precision depends on both the completeness of records and attributes. Since these two factors have been considered in the optimized TD metric, it is reasonable that ID-DGA outperforms the existing approaches.

Table 4 Impact of the proposed operators

6.2 Impact of the proposed operators

In this section, we investigate the impact of the proposed operators by comparing ID-DGA with three variants. These three variants are described as follows:

  1. 1.

    without-framework: In this variant, the distributed population model is removed from ID-DGA. Accordingly, a single population is utilized.

  2. 2.

    without-mutation: In this variant, the information-driven mutation operator is replaced by the traditional mutation operator in GA.

  3. 3.

    without-improvement: In this variant, the information-driven improvement operator is removed from the complete ID-DGA.

In Table 4, the average and standard deviation values of TD over 25 independent runs are calculated and listed. The best results in these test instances are marked in boldface. Overall, the original ID-DGA can outperform the compared three variants on all 16 test instances. Compared with the without-framework variant, the advantage of the distributed population model in the ID-DGA is confirmed, which can effectively improve the population diversity and achieve a better balance between exploration and exploitation. Compared with the without-mutation variant, the advantage of the proposed information-driven mutation operator is verified, improving the information release while enhancing the population diversity. Compared with the without-improvement variant, the advantage of the information-driven improvement operator is verified, which can adaptively adjust the anonymization solutions according to the given information and accordingly improve the competitiveness of the anonymization solutions. The complete ID-DGA can outperform the compared three variants since the distributed population model and three information-driven operators are effective during the optimization.

In addition, the Wilcoxon rank-sum test with a 0.05 level is utilized. The symbol \(^\dagger \) shows that the corresponding result is significantly better than the compared results. Overall, in all 16 test instances, the advantage of the complete ID-DGA is significant.

Fig. 7
figure 7

Speedup ratios of ID-DGA on all test cases

6.3 Speedup ratio

The speedup ratio is a significant metric in distributed algorithms as it reflects their computational efficiency. This ratio is obtained by dividing the distributed algorithm’s running time by the sequential algorithm’s running time. A distributed algorithm that exhibits a higher speedup ratio can achieve superior distributed computation efficiency, which is vital for preserving the algorithm’s scalability. Therefore, the speedup ratio is an important indicator to consider when evaluating the performance of a distributed algorithm.

In the proposed ID-DGA, each sub-population is allocated to a single compute core, and each sub-population evolves independently. Thus, the number of sub-populations in ID-DGA directly reflects its parallel granularity. ID-DGA’ running time with different numbers of sub-populations (1, 2, 4, 8, 16) is measured. The ID-DGA with a single sub-population is regarded as the sequential algorithm, and the ID-DGAs with multiple sub-populations are regarded as the distributed algorithms.

In Figure 7, the speedup ratios of ID-DGA on 16 test cases are plotted. The speedup ratios significantly increase when the parallel granularities of ID-DGA increase from two to sixteen. The speedup ratio curves in different test cases vary. This is because different test cases are of different complexity and need different evaluation time. In general, the communication time of ID-DGA on different test cases does not have a significant difference. Thus, a test case of higher evaluation time, such as \(T_{7}\) and \(T_{10}\), can help achieve speedup ratios. When adopted in actual optimization problems, which contains higher complexity, the proposed ID-DGA can further show its scalability and speed advantages.

Table 5 Running time comparison between ID-GA and ID-DGA with communication consumption of ID-DGA

6.4 Communication consumption analysis

In Table 5, we analyze the communication cost of our proposed ID-DGA. In Table 5, we use four metrics, i.e., running time (RT), communication time (CT), speedup ratio (SR), and communication ratio (CR). First, we compare the RT of ID-GA and ID-DGA. We can see a dramatic reduction of RT compared with ID-GA. Accordingly, we can see that the SR values of ID-DGA in all the test instances are between 2.5 and 4.0, verifying the effectiveness of the distributed model in ID-DGA in improving its running speed. Second, we analyze the values of CT and CR achieved by ID-DGA. The communication consumption of the proposed ID-DGA does not exceed 10%, verifying that the communication cost of ID-DGA does not significantly affect the speed of ID-DGA. Overall, the advantages of ID-DGA in speedup and low communication cost are verified in this section.

7 Conclusion

This paper presents an information-driven distributed genetic algorithm to achieve optimal anonymization through attribute generalization and record suppression. The proposed algorithm introduces an information-driven crossover operator for exchanging information between anonymization solutions, and an information-driven mutation operator to promote information release in mutant anonymization solutions. Furthermore, an information-driven improvement operator has been proposed to adaptively refine anonymization solutions. To enhance population diversity, the proposed algorithm integrates a distributed population model. Additionally, a two-dimensional selection operator has been designed to identify the competitiveness of different anonymization solutions. The effectiveness of all the proposed components has been verified through experiments, demonstrating the superiority of the proposed algorithm in both solution accuracy and convergence speed.