1 Introduction

The modern healthcare ecosystem is governed by the integration of diverse data sources, which introduces privacy challenges since patient data contain sensitive information [1, 2]. Healthcare data analysis has significantly evolved together with research on preserving patient privacy [3, 4]. Privacy preservation is particularly crucial in healthcare, given the sensitive nature of medical data. One of the most popular and widely used privacy preservation techniques, named k-anonymity, has long been a focus of research as a solution to address such challenges [5]. State-of-the-art research in privacy-preserving techniques continues to analyze k-anonymity principles in diverse applications, from machine learning to cloud services [6,7,8].

Despite the critical importance of privacy preservation in safeguarding sensitive medical data, there is a notable deficiency of learning and training material for privacy preservation. This becomes more crucial if we consider that privacy regulations such as General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) have strict privacy requirements on sensitive data mandating organizations to implement robust measures to ensure data privacy. Therefore, there is a clear need for accessible, practical training materials that not only explain the legal and theoretical aspects of these privacy regulations, but also provide hands-on examples and scenarios for applying techniques like k-anonymity in real-world situations. Such resources would greatly enhance the understanding and application of privacy preservation principles, thereby contributing to more secure and compliant data handling practices.

1.1 Contribution

The research provides a strong foundation for understanding the complexities and trade-offs related to privacy preservation. More specifically, this paper presents the development of a k-anonymity algorithm intended for educational purposes, enabling learners to practice and understand k-anonymity concepts such as data generalization and suppression as well as explore the impact of k-anonymity parameters on information loss. By offering a simplified version of the k-anonymity algorithm, complete with validation tests and an example dataset, the research facilitates hands-on practice in the field of privacy preservation. Finally, this paper briefly presents five learning scenarios that can be used for acquiring technical skills and as guidelines to prepare a series of courses on the topic. The contributions of this work are open source, allowing individuals to comprehend the intricacies of the algorithm and follow the validation tests to experiment with the given dataset and apply k-anonymity to create anonymized datasets. Thus, the paper serves as an accessible resource for those seeking to include k-anonymity topics in the education context, or for individuals that want to understand and become familiar with k-anonymity. Recipients of this work can include educators, privacy engineers, healthcare security engineers, and GDPR practitioners. Overall, the contributions of the paper are as follows:

  • Implementation and development of k-anonymity: We have developed a simplified version of k-anonymity for educational purposes. The algorithm is published online [9] along with a dataset that is used for validation purposes and hands-on practice.

  • Validation tests: The seven validation tests assess the capabilities of the developed algorithm, examining the different functionalities that the k-anonymity provides. The validation includes configuration changes on the anonymity levels and offer practical insights into the usage of the developed algorithm.

  • Learning scenarios: Five learning scenarios are provided in the online repository [9] and briefly described in the paper. The learning scenarios are based on the developed approach of k-anonymity, using the validation tests for utilizing the developed functionalities.

1.2 Paper structure

The rest of the paper unfolds as follows. Section 2 the related work, while Sect. 3 presents the methodology and building blocks of the research, including the development of a mathematical model for k-anonymity, and the selection of a dataset among others. Section 4 provides the implementation of the k-anonymity technique and the algorithmic functions of the approach. Section 5 includes the validation tests conducted to evaluate the effectiveness and efficiency of the implementation. Section 6 presents the learning scenarios and comprehensive explanations for each of the scenarios. Finally, Sect. 7 concludes the paper by summarizing the findings and discussing the future work.

2 Related work

Various aspects of privacy preservation in healthcare have been studied by the related work. Ren et al. focused on privacy-enhancing techniques on the Internet of Things (IoT) and the role of data anonymization in addressing privacy concerns within IoT ecosystems [10], while Dimopoulou et al. conducted research on the challenges for securing health information in mobile environments [11]. Louassef et al., provided a new taxonomy of privacy preservation techniques in healthcare systems, mentioning also the need for a better understanding of the techniques [12]. Similarly, Vovk et al. provided an extended literature review on methods and tools for anonymization for healthcare data, offering a comprehensive overview of the current state of anonymization techniques [13].

K-anonymity seems to be a widely-established method, with ongoing research focusing on its application in various domains and different datasets. Jain et al. provided improvements of k-anonymity in the context of big data, offering insights into practical implementations and privacy improvements [14]. Sarcevic et al. conducted an analysis of the effectiveness of k-anonymity by providing a comprehensive understanding of factors that influence its efficacy [15]. Jain et al. further contributed by proposing an enhanced secured map reduce layer for big data processing [16].

Regarding the application of k-anonymity in healthcare, in a study conducted by Rajendran et al. [17], the authors thoroughly examined the application of k-anonymity, l-diversity, and t-closeness in medical data [18]. In a recent work, Asad et al. introduced the Secure Hierarchical Federated Learning (SHFL) framework, leveraging k-anonymity for enhanced security in federated learning within smart healthcare systems [19]. Next, Sangaiah et al., investigated entropy strategies for privacy preservation in healthcare [20]. In another work [10], the researchers addressed the challenges of data sharing and the importance of observing quasi-identifiers. More specifically, the authors introduced a Python library that implemented various anonymization techniques for assessing the level of anonymity in a dataset. In addition, Mahesh et al. [21] proposed an anonymization technique that preserves also the utility of the published data. In another research, Abouelmehdi et al. focused on security and privacy challenges in big healthcare data [22]. Furthermore, Arava and Lingamgunta focused on the cloud infrastructure by presenting an adaptive k-anonymity approach tailored for privacy preservation in cloud computing [23]. Finally, De Pascale et al. [24] enhances k-anonymity algorithm to avoid the substantial information loss incurred in big data during anonymization.

In addition to the aforementioned, data protection and privacy in healthcare has significant challenges. A main challenge is balancing between data utility and privacy when publishing or sharing datasets between stakeholders [25]. Furthermore, the requirement for compliance with regulatory frameworks such as HIPAA or GDPR adds to the overall complexity of data protection. This is due to the requirement to fulfill privacy regulations while ensuring the effective utilization of datasets [26]. Other challenges rely on extended interconnection of the healthcare devices and of the sensors to get better diagnosis and monitoring to the patients [12]. Additionally, the heterogeneity of the data sources has variations in formatting and standards, which introduces interoperability challenges to implement consistent privacy measures [27]. It should be noticed that even if privacy preservation is enabled, re-identification attacks are still possible even after anonymization [28].

The above research depicts that even if a large body of research on various aspects of k-anonymity exists, to the best of our knowledge, there is no specific research for educational purposes to provide insights into k-anonymity concepts and how it can be actually applied in healthcare data. This research can help to clarify the importance of various parameters’ selection in the k-anonymity algorithm and illustrate how they balance an intrinsic trade-off between information loss and privacy.

3 Methodology and building blocks

In this section, we present our selected privacy preservation algorithm, which is k-anonymity. The reason behind our choice is that k-anonymity is one of the most widely researched algorithms. Despite the emergence of new frameworks such as differential privacy, k-anonymity remains popular as shown in the related work, mainly due to its simplicity and straightforward implementation. Moreover, k-anonymity lays the groundwork for more sophisticated algorithms such as l-diversity and t-closeness, facilitating their development and understanding.

The main idea behind k-anonymity is to group records with similar attributes to avoid individual identification. The two main techniques to achieve this are data suppression and generalization. Data suppression removes sensitive or identifying information from the dataset, eliminating direct matching that could reveal individual identities. Suppressed information mitigates re-identification risks by breaking the direct link between anonymized records and original individuals. Data generalization replaces identifying attributes with more generalized values. The k-anonymity data model includes the following attributes [5]:

  • Identifiers: Allow unique identification of a row in the dataset and should be suppressed to preserve anonymity (e.g., name, SSN).

  • Quasi-identifiers: Have the potential for linkage with other datasets. It is assumed that the entity possessing the data knows which attributes fall into this category. However, accurately determining these attributes can be challenging in practical scenarios, as it requires considering the entirety of available data in the broader context. Examples of common quasi-identifiers are dates (such as birth, death, admission, discharge, visit, and specimen collection), locations (such as postal codes, hospital names, and regions), race, ethnicity, languages spoken, and gender, among others.

  • Sensitive attributes: need to be protected since they include private data (e.g., a disease). Typically, sensitive attributes are kept in the published data as they are the research target of the data analysis.

Identifiers and quasi-identifiers are attributes in a dataset that could reveal sensitive or private information about an individual. Examples of such attributes could include medical conditions, income, or sexual orientation. In k-anonymity, it is important to protect private and sensitive data by creating anonymous groups. This can be done by using data generalization by replacing a specific value in a dataset with a more general value. For example, if a dataset contains a person’s specific birthdate, generalization might replace this with an age range (e.g., 20–30). The idea is to remove any identifiable information by conducting, linkability tests while still retaining the useful information in the dataset. The methodology followed to complete this research is presented in details in Fig. 1.

Fig. 1
figure 1

Methodology flow diagram. Each methodology step is depicted, followed by the mathematical model, the development of the k-anonymity algorithm, the validation tests and the creation of the learning scenarios

To begin with, as depicted in Fig. 1 a comprehensive literature analysis was conducted to explore existing research on k-anonymity and its implementation. To formulate a theoretical basis and to develop the algorithm for k-anonymity, a mathematical model was created. Furthermore, a healthcare data set was investigated that has practical applicability for heart-attack prediction. The research then focused on deploying k-anonymity validation tests, ensuring the assurance of k-anonymity. In the final stage, the learning scenarios were created to bridge the gap between theoretical concepts and practical implementation, using the algorithm and validation tests as the foundational basis for exploration and understanding.

3.1 Mathematical model for k-anonymity

The dataset D is defined as a dataset with n records and m attributes, represented as a table with rows and columns. The attributes are represented in a tabular format, where rows correspond to individual records \(R_i\), columns correspond to attributes \(A_j\), and specific values \(v_ij\) hold the attributes’ content. Within the dataset, quasi-identifiers Q, are identified, which collectively enables the identification of individuals or entities without revealing their full identities. The process of generalization plays a pivotal role in the pursuit of privacy preservation by transforming attribute values into higher-level, less granular representations. Each attribute \(A_j\) in a given record \(R_i\) undergoes generalization, resulting in the generation of a generalized value \(v'_ij\) to enhance privacy.

$$\begin{aligned} D = \begin{bmatrix} R_1 &{} A_1 &{} A_2 &{} \ldots &{} A_m \\ R_2 &{} v_{1,1} &{} v_{1,2} &{} \ldots &{} v_{1,m} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ R_n &{} v_{n,1} &{} v_{n,2} &{} \ldots &{} v_{n,m} \end{bmatrix} \end{aligned}$$

Each row \(R_i\) represents a record in the dataset, \(A_1, A_2, \ldots , A_m\) represent attributes, \(v_{i,j}\) represents the value of the attribute \(A_j\) in the record \(R_i\). Let Q be a set of quasi-identifiers, where \(QuasiID \subseteq \{A_1, A_2, \ldots , A_m\}\).

Generalization is the process of transforming attribute values to a higher-level representation to reduce granularity. For the attribute \(A_j\) in the record \(R_i\), generalization is represented as:

$$\begin{aligned} G(v_{i,j}) = v'_{i,j} \end{aligned}$$

where \(v'_{i,j}\) is the generalized value.

A critical aspect of k-anonymity entails the formation of equivalence classes, wherein records with indistinguishable generalized quasi-identifier values are grouped together. However, when the enumeration of records within an equivalence class falls below a predefined k-anonymity threshold, data suppression is employed. An equivalence class comprises a subset that encompasses all elements demonstrating equivalence among themselves. In such instances, the foundation rests upon the generalized quasi-identifier. Each equivalence class delineates a collection of rows characterized by identical generalized quasi-identifier values. When dealing with each equivalence class, in scenarios where the enumeration of rows within the class falls below the required \(k-value\), data suppression is employed on the quasi-identifier values, accomplished by suppressing the values which is shown in equation (1).

$$\begin{aligned} D[i,j] = *, \text {for } j \in \text {quasi-identifier} , \text {if count}(c[i]) < k \end{aligned}$$
(1)

Where c[i] denotes the equivalence class of the i-th record of the dataset. To quantify the extent of anonymization, the calculation involves determining the percentage of rows that have undergone anonymization. This computation is achieved and calculated using the equation (3), by dividing the enumeration of rows where data suppression is applied by the overall number of rows, represented as n, and then multiplying the result by 100:

$$\begin{aligned} percentage = \frac{\text {rows with data suppression}}{\text {total rows in dataset}} \times 100 \end{aligned}$$
(2)

K-Anonymity ensures that for each record \(R_i\), there exist at least \(k-1\) other records in the dataset with the same combination of quasi-identifiers. For every record in the dataset, there exists at least one other record such that the intersection of the set of quasi-identifiers is a subset of the intersection of the set of quasi-identifiers as presented in equation (3).

$$\begin{aligned} \forall i, \exists j \ne i \text { s.t. } \bigcap \limits _{k} QuasiID \subseteq \bigcap \limits _{k} QuasiID_j \end{aligned}$$
(3)

\(QuasiID_i\) represents the quasi-identifiers for record \(R_i\), \(QuasiID_j\) represents the quasi-identifiers for record \(R_j\), \(QuasiID \subseteq QuasiID_i\) indicates that the quasi-identifiers of record \(R_i\) contain all the quasi-identifiers in Q, \(QuasiID \subseteq QuasiID_j\) indicates that the quasi-identifiers of record \(R_j\) contain all the quasi-identifiers in QuasiID.

Anonymization involves generalization and/or suppression of attributes to achieve K-Anonymity while minimizing information loss. Generalization \(G(v_{i,j})\) is applied to quasi-identifiers to create generalized values. Data Suppression \(S(v_{i,j})\) is applied to quasi-identifiers that cannot be generalized.

The utility of the anonymized dataset is assessed in terms of how well it preserves the original data’s statistical properties while protecting privacy. Information loss occurs due to generalization and suppression. Mathematically, the goal of K-Anonymity is to find an anonymized dataset \(D'\) such that:

$$\begin{aligned} D' = \begin{bmatrix} R_1 &{} QuasiID &{} A_{s1} &{} A_{g_1} &{} \ldots &{} A_{g_k} \\ R_2 &{} QuasiID &{} A_{s2} &{} A_{g_1} &{} \ldots &{} A_{g_k} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots \\ R_n &{} QuasiID &{} A_{sn} &{} A_{g_n} &{} \ldots &{} \ldots \end{bmatrix} \end{aligned}$$

Each row (\(R_i\)) represents a record in the dataset. The attributes of each record are represented as columns. Specifically, there are columns for Q (which could represent quasi-identifiers), \(A_s\) (which represents suppressed attributes), and \(A_g\) (which represents generalized attributes). k is the minimum required group size for k-Anonymity.

The challenge in k-anonymity is to find the optimal generalization and suppression strategy that maximizes utility while satisfying the k-anonymity property and minimizing information loss.

3.2 Dataset

The dataset [29] contains information about individuals’ health-related attributes (a sample of the dataset is provided in Table 1). The dataset includes attributes including age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting electrocardiographic results, maximum heart rate, exercise-induced angina, ST depression induced by exercise relative to rest, ST segment slope, and presence of heart disease [30].

In this research, the selection of quasi-identifiers is indicative and sensitive identifiable information usually includes the Heart Disease. Furthermore, the cholesterol and age values are substantiated by the dataset’s periodic blood test results, depicting patients’ values over time. The fluctuations in cholesterol, along with other attributes, allow for indirect identification of individuals within the dataset. The selection of the above is supported by existing research recognizing cholesterol values as quasi-identifier [31, 32]. Furthermore, it is crucial to highlight that the selection of quasi-identifiers, including cholesterol, plays a pivotal role in executing the learning scenarios. Taken this into account, the learning scenarios (see as an example Learning Scenario 03 [9]) encourages practitioners to explore alternative attributes as quasi-identifiers.

Table 1 Sample rows from the dataset [30]

The dataset adheres to the Fast Healthcare Interoperability Resources (FHIR standard, which is widely adopted for the electronic exchange of healthcare information. The standard supports modern RESTfulAPIs that facilitate easy integration, support lightweight communication, and enable the development of scalable and efficient healthcare applications [33]. FHIR is preferred over its predecessor, HL7v3, due to the RESTfulAPI it provides and the support with popular data interchange formats such as JSON, XML, and RDF. Furthermore, the modular design of FHIR reduces redundancy and complexity [34] and coupled with its interoperability [35] has led to quicker and widespread adoption in the healthcare industry.

4 K-anonymity implementation

This section presents a straightforward implementation of k-anonymity developed in Python programming language. For the educational purposes of this paper, optimization techniques are unnecessary and therefore not considered. Python is popular for its high-portability and integration capabilities. It is frequently used in data science as it offers a wide range of libraries, including scikit-learn [36], NumPy [37], SciPy [38], and Pandas [39]. Moreover, Python can be used in combination with interactive environments such as Jupyter Notebooks [40], which offer direct experimentation with code blocks using only a browser. Furthermore, Python maintains a high level of adaptability and its user-friendly nature makes it ideal for data scientists.

As shown below, Algorithm 1 provides an overview of the implementation, while the source code can be found in [9].

The algorithm takes as input a dataset D, a user-defined k-anonymity threshold k, a set of quasi-identifiers QuasiID, and a dictionary of generalization intervals G. It outputs an anonymized dataset \(D'\) and a percentage value percentage quantifying the degree of anonymization achieved. The code draws upon the work of Machanavajjhala et al. [41] and Shah et al. [42] to implement the abovementioned data anonymization techniques.

Algorithm 1
figure a

The k-anonymity algorithm

The developed algorithm initializes by determining the total number of records in the dataset, extracting the header row containing attribute names, and establishing the number of quasi-identifiers. It also initializes a dictionary to record the frequencies of unique combinations of quasi-identifier values and an accumulator to monitor the number of anonymized records.

Generalization is applied to specific attributes as defined in the supplied generalization dictionary. Recall that generalization entails the transformation of fine-grained attributes into coarser-grained intervals, thereby reducing the precision of the data. The algorithm proceeds to calculate the occurrences of each unique combination of quasi-identifier values within the dataset. This preparatory step identifies combinations that are less frequent and, consequently, more susceptible to privacy breaches.

For each record in the dataset, the algorithm assesses the combination of quasi-identifier values it embodies. In cases where the frequency of the combination falls below the user-specified k threshold, the algorithm anonymizes the corresponding quasi-identifying attributes by substituting specific details with symbols, typically asterisks (*). The calculation of anonymized records is incremented accordingly. Anonymization progress is quantified by computing the percentage of anonymized records relative to the total number of records. This metric reflects the extent to which privacy has been enhanced. The algorithm finalizes its execution by generating a new dataset \(D'\) that comprises the header row and the anonymized data.

5 Validation tests and anonymization process

In this section, seven validation tests are presented which are used to explore the anonymization parameters of the implemented k-anonymity. It should be noted, that in the validation tests all the attributes can be considered as quasi-identifiers, except for HeartDisease, which is considered a sensitive attribute. In the following tests, the selection of quasi-identifiers is indicative and is used to better depict the execution of the algorithm and support the learning scenarios. The validation tests are used as follows:

  • Validation Test \(t_1()\) is used in order to provide a first execution of the algorithm using data suppression.

  • Validation Test \(t_2()\) and Validation Test \(t_3()\): These tests evaluate the basic functionality of the data anonymization, using both data suppression and data generalization. The Validation Test \(t_3()\) test extends the evaluation by introducing more attributes as quasi-identifiers. The objective is to confirm that the algorithm can apply different generalization options.

  • Validation Test \(t_4()\) and Validation Test \(t_5()\): These tests provide a visual representation of the impact of anonymity level (\(k-value\)). The goal is to understand how different levels of anonymity affect the outcome.

  • Validation Test \(t_6()\) and Validation Test \(t_7()\): Similar to the above, these tests provide a visual representation of the impact of anonymity level (\(k-value\)). The tests are used to provide the impact of using data suppression, without using data generalization. The outcome is visually represented and can be compared to the outcome from Validation Test \(t_4()\) and Validation Test \(t_5()\).

5.1 The purpose of validation tests

The objective of the validation tests is to make sure that quasi-identifiers in a dataset can be protected by enabling data generalization. Algorithm 2 represents the validation function designed to execute a series of validation tests on the anonymized data. Each test \(t_i()\) within the test set T is subjected to the routine that performs the tests. These tests are designed to evaluate whether the algorithm’s mechanisms are working as intended. Furthermore, using the validation tests, the learning scenarios are enhanced by sample executions of the algorithm that can support the understanding of the strengths and limitations of k-anonymity. By executing the k-anonymity and the validation tests, the anonymization process transforms the input dataset using data suppression and data generalization and provides an evaluation metric, indicating the extent to which the dataset has been anonymized. This percentage has two main purposes: to indicate the level of privacy protection achieved [43] and to support the decision-making for the trade-off between privacy and data utility [44, 45].

Table 2 Sample from the anonymized dataset as a result from the execution of Validation Test t1() - Algorithm 2

The seven validation tests that are further explained in the sections below are organized into a set of tests denoted as T as presented in Algorithm 2. The generic validation test algorithm provides an initial understanding of the arguments including the dataset D, the \(k-value\), the quasi-identifiers QuasiID, and the generalization interval G. This consists of various individual tests (\(t_i()\)), each assessing specific aspects of the algorithm execution. The tests within T are designed to cover a spectrum of scenarios, ensuring the comprehensive validation and execution of the algorithm. Therefore, each test involves a variance of execution of the k-anonymity to the validation data.

Algorithm 2
figure b

K-anonymity: Overall validation tests and a first validation test by applying data suppression

The threshold of k-anonymity can be set by defining the \(k-value\), and the selection of quasi-identifiers as presented in a sample execution of the algorithm in Listing 1.

figure c

The execution results with a dataset that contain at least k records with the same attributes to ensure a reasonable level of privacy protection. An overall sample of the generated dataset after the execution of the algorithm and enabling all the capabilities that will be further explained in each validation test is presented in Table 2. There are three arguments for executing the k-anonymity, as presented in Listing 1:

  • The first argument of the function k_anonymize is the file path to the CSV dataset file.

  • The second argument is the minimum number of records in each group of equivalence classes (\(k-value\)) to satisfy the k-anonymity property.

  • The third and final argument is the list of column names of the quasi-identifiers in the dataset that should be protected.

The main function reads the dataset file at the specified path, perform generalization on the quasi-identifiers if necessary, and then group the data into equivalence classes based on the values of the quasi-identifiers. The function then replaces each record in an equivalence class with an asterisk if the calculation of the class is less than k. Finally, the function calculates the percentage of the dataset that was anonymized and write the resulting anonymized dataset to a new CSV file (a sample is presented in Table 2).

In Validation Test \(t_1()\), the \(k-value\) with \(k=3\), means that each group in the dataset must contain at least three individuals with identical attributes. As an example, the quasi-identifier in this dataset was set to be the attributes ’Age’ and ’Cholesterol’. Each row in the dataset represents an individual, and the columns represent different attributes of that individual. The first column has been anonymized, represented by asterisks (*), to ensure that the identity of the individual cannot be determined.

5.2 Validation tests t2() and t3(): data suppression and data generalization intervals

In this validation test, the parameter called ‘generalization_intervals‘ is set to a dictionary, e.g., \(\text {{"Age": 10}}\). This dictionary maps attributes/features to interval widths used during generalization. Specifically, it indicates that the values will be generalized by dividing each value by 10 and rounding down to the nearest integer. Thus, if \(A\) represents the original age value and \(A'\) represents the generalized age value, then \(A' = \lfloor \frac{A}{10} \rfloor \times 10\). If \(A\) represents the original age value and \(A'\) represents the generalized age value, then \(A' = \left\lfloor \frac{A}{10} \right\rfloor \times 10\).

Suppose we have an original age value, \(A\), of 37 years. To find the generalized age value. First, calculate \(\frac{37}{10}\), which equals the value 3.7. Next, apply the floor function (\(\left\lfloor \cdot \right\rfloor \)) to round down to the nearest integer, which is 3. Finally, multiply 3 by 10. Therefore, the generalized age value, \(A'\), for an original age of 37 years is 30 years. This demonstrates how the formula works to generalize age values by dividing by 10 and rounding down to the nearest multiple of 10. Data suppression is enforced by replacing the quasi-identifiers with asterisks (*). In the validation tests, the generalization intervals are being tested.

Table 3 Sample rows from the anonymized dataset from the execution of Validation Test \(t_2()\) - Algorithm 3
figure d

In this Validation Test (Validation Test \(t_2()\) - Algorithm 3), the attribute ""Age" is selected as a quasi-identifier and a generalization interval with the value of 10 is selected. The generalization threshold defines that the age values will be grouped into ranges of 10 in the anonymized dataset. For example, if an individual’s age is 42, it is generalized to the range of 40–50. This level of generalization adds an extra layer of protection to the quasi-identifier.

A sample of the result dataset after the execution of this validation test is presented in Table 3. Validation Test \(t_3()\) and the execution presented in Listing 2 regards the selection of different configuration sets and the desired anonymity level \(k=3\), while the quasi-identifiers are set to be "Age", "Cholesterol", and "Fasting BS". The interval value for the generalization is set to 80 and 20 respectively, indicating that values are generalized to the nearest multiples of these intervals.

A sample of the anonymized dataset of the result of this validation test is presented in Table 4. The output from the execution of the Validation Test \(t_3()\) is presented in Listing 2 which along the anonymized dataset as an output, it also provides as percentage on how many rows the data suppression and data generalization was applied.

figure e
Table 4 Sample rows from the anonymized dataset as a result of the execution of Validation Test \(t_3()\) - Algorithm 3

More specifically, in the given output presented in Listing 2, it is presented that 1.7429% of the rows in the dataset have been anonymized to meet the desired anonymity level. The attribute "Age" which is selected as a potential quasi-identifier is being replaced with the range of "40–60" using data generalization. The "Cholesterol" attribute has also been generalized using the selected intervals and are converted to the relevant values using the relevant generalization intervals. Additionally, data suppression applies the asterisks (*) according to the procedure.

5.3 Validation tests t4() to t7(): impact of data generalization and of the selection of quasi-identifiers

In this section, validation tests are used to provide more information regarding the percentage of rows where data suppression was applied within the dataset with varying values of k (as seen in Algorithm 4). We assume that data suppression increases the information loss in comparison to data generalization. The test aim to assess the impact of \(k-values\), which represents the desired level of anonymization, on the dataset’s privacy and utility.

Algorithm 3
figure f

Impact of data generalization and quasi-identifier selection

The validation tests, as depicted in Algorithm 4, generate a graphical representation presented in Fig. 2. The x-axis corresponds to different k-values ranging from 0 to 100, while the y-axis represents the percentage of rows to which data suppression was applied. The output from the execution of Validation Test \(t_4()\), and Validation Test \(t_5()\) are presented in Fig. 2. The generated plot depicts how the process of attribute selection, as quasi-identifier, impacts the dataset anonymization.

Fig. 2
figure 2

Results from Validation Test \(t_4()\) without using generalization and Validation Test \(t_5()\) with generalization option enabled. When data generalization is enabled, the information loss decreases as data suppression is less applied

In Validation Test \(t_4()\) no generalization is applied while in Validation Test \(t_5()\) generalization is applied on three quasi-identifiers and this can be compared to the plot (Fig. 2). By examining the results from these validation tests, several critical aspects of data anonymization can be retrieved:

  1. 1.

    The impact of k-value: The plot reveals how the percentage of the rows where data suppression is applied changes as the anonymity threshold increases. This investigation highlights the trade-off between achieving a higher level of privacy (higher \(k-value\)) and preserving dataset utility.

  2. 2.

    Role of generalization: The influence of generalization on the anonymization process is depicted with the execution of the two different validation tests. The green line depicts the impact of the validation test with generalization option enabled. This demonstrates that by using data generalization, data suppression will be applied in less rows, thus reducing the information loss.

  3. 3.

    Data privacy vs information loss: The analysis contributes to the decision-making as it provides information on the scale that data suppression was applied in the dataset. By changing the selection of the attributes as quasi-identifiers, the plot provides the impact of each choice on the dataset.

In Fig. 3 the relationship between the anonymity threshold (\(k-value\), from 0 to 100) and the percentage of the rows where data suppression was applied. Different combinations of quasi-identifiers can be tested in this validation test. For example, the Validation Test \(t_6()\) provides the percentage of data suppression when only the "Age" attribute is considered a quasi-identifier, while during the Validation Test \(t_7()\) both the "Age" and "Sex" attributes are considered quasi-identifiers. The x-axis represents the value of \(k\), and the y-axis represents the percentage of the rows where the data suppression is applied for this \(k-value\) and the relevant attribute selection impact.

Fig. 3
figure 3

Results from Validation Test \(t_6()\) by selecting one attribute as quasi-identifier and Validation Test \(t_7()\) by selecting two or multiple attributes as quasi-identifiers. The selection of multiple quasi-identifiers increases the level of data suppression on the dataset

The results as presented in Fig. 3 provide information regarding the impact of the different selection and combinations of quasi-identifiers. It shows that the level of data suppression increases as the \(k-value\) gets higher, and how the change in the selection of quasi-identifiers affects the outcome.

6 K-anonymity-unveiled: progressive learning scenarios for k-anonymity

The learning scenarios presented in this section include teaching activities on the theory and practice of k-anonymity that cover different aspects of the anonymization process [9]. The scenarios can be used for acquiring technical skills and as guidelines to expand on and/or prepare a series of courses on the topic.

6.1 Learning scenario 01: fundamental concepts in K-anonymity

This learning scenario is introductory and mainly includes the implementation details of k-anonymity. The scenario begins with the dataset description to understand the data structure and get familiar with the attributes and potential quasi-identifiers. By examining the code snippets and running the Validation Test \(t_1()\), the underlying logic of the main functionality of k-anonymity algorithm is presented. Investigation of the \(k-value\) is also part of this introductory learning scenario.

6.2 Learning scenario 02: attribute selection and K-value selection

This scenario focuses on the difference of the attribute selection as quasi-identifiers and the impact of \(k-value\). The anonymized dataset is investigated and analyzed to comprehend the impact of attribute selection and the outcome of the anonymization. Through selective inclusion or exclusion of certain quasi-identifiers running the Validation Test \(t_6()\) and Validation Test \(t_7()\), the observation continues on how the attribute selection influence the achieved level of anonymity. Therefore, a practical understanding of the impact of selecting quasi-identifiers is achieved.

6.3 Learning scenario 03: generalization intervals and attribute selection

The focus of this scenario regards the exploration of the impact and benefits of data generalization in comparison to data suppression. Generalization, as a technique, is used to transform the values of quasi-identifiers considering a threshold. Therefore, the data maintain their statistical properties by slightly changing the values of the quasi-identifiers.

The scenario stars with the application of k-anonymity by enabling data generalization on a single quasi-identifier. A generalization interval is specified as presented in Validation Test \(t_2()\) and Validation Test \(t_3()\) as a threshold and by changing the \(k-values\) the anonymized dataset is being observed. The balance between privacy and information loss is investigated. The observed results provide a familiarity with the benefits of data generalization. Subsequently, the applicability of k-anonymity with data generalization is being tested on two or multiple attributes to see the results. This exploration allows for an understanding of the interplay between data generalization and \(k-values\).

6.4 Learning scenario 04: visualization as an overview of the anonymization process

The focus of this learning scenario is on the visualization aspects of k-anonymity. Visual representation of the data allows a comprehensive overview of k-anonymity while considering various attributes and the application of data generalization.

Multiple plots are suggested to be generated running the Validation Tests \(t_4() - t_7()\) with various configuration options in order to do a throughout comparison between them. By altering the selection of attributes and the applicability of data generalization, the plots can provide interesting information. By observing the behavior of k-anonymity, the visualization depicts the trade-off between anonymity and information loss as seen in the previous learning scenarios. This supports the decision on the optimal, \(k-value\) in order to keep the balance between privacy preservation and information loss.

6.5 Learning scenario 05: privacy needs and privacy requirements

The procedure of defining privacy requirements can affect the decision-making on privacy preservation. The focus of this scenario is to provide guidance for defining the privacy requirements according to specific privacy regulations that govern the dataset. This scenario is mostly discussion on the results and brainstorming on how the privacy regulations or specific standards can affect the decision on the \(k-value\) or on the selection of quasi-identifiers.

Through iterations and testing of different configurations, an optimal fit that aligns with the privacy needs can be achieved by the recursive procedure that has been presented in the previous learning scenarios. This iterative process facilitates the acquisition of a comprehensive understanding of the importance of defining privacy requirements and making informed decisions about data utility constraints when implementing k-anonymity.

7 Conclusions

This work thoroughly analyzes the practical application of k-anonymity in the healthcare domain. To this end, a comprehensive implementation of k-anonymity was presented, specifically designed for educational purposes. The implementation along with the validation tests that demonstrate k-anonymity in practice can help individuals to become familiar with privacy preservation techniques. More specifically, the analysis focuses on measuring the achieved level of privacy, assessing the impact of data suppression and generalization on information loss, and validating the efficacy of data utilization after applying k-anonymity. Finally, this research provides five learning scenarios [9] that can be used to acquire technical skills and as guidelines to engage in further or to prepare a series of courses on the topic.

The research has certain limitations as it focuses on k-anonymity, although there are many other techniques for privacy preservation, such as differential privacy. Furthermore, the study relies on a specific dataset in healthcare domain. It should be also noted that the developed k-anonymity algorithm was not optimized for performance or high reliability, but serves as a simplified model suitable for the educational objectives of the research. Finally, the validation tests can be further extended along with visualization options to explain better the effects of anonymization. As a future work, the training material can be enriched with emerging privacy preservation techniques, such as differential privacy. Additionally, broadening the scope of learning scenarios to encompass datasets from diverse fields such as finance, social media, or e-commerce offering valuable insights and a more comprehensive understanding of the overall learning process.