1 Introduction

Rounding is based on the operating principle of data discretization or quantization, which maps a given continuous variable into a discrete set of values. This is achieved by replacing the values of a given set \(X=\{x_1,\ldots ,x_n\}\) into a much smaller set \(R=\{r_1,\ldots ,r_k\}\), where \(k<n\) and each \(r_i\) is less specific compared to \(x_i\) values. Here, \(r_i\) is a rounding point and R indicates a collection of rounding points which defines a rounding set. Rounding values are defined so that minimum protection is guaranteed which describes the minimum criteria required for data protection. Privacy is achieved by replacing multiple \(x_i\) values by a single \(r_i\). In rounding the biggest challenge is to generate the rounding set R, which reduces the expected distortion while avoiding the risk of “disclosure”. A disclosure occurs when an adversary exploits a released dataset in order to obtain information about an individual otherwise not known to him. When each data record contains unique attribute values, it is straightforward for an adversary with relevant background information to attempt a disclosure, either by linking a given data record to a specific individual (identity disclosure) or by learning a sensitive attribute belongs to an individual (attribute disclosure).

Obtaining the rounding points can be explained with respect to scalar quantization (SQ). A given attribute/vector is partitioned into homogeneous groups, and a representative value is chosen for each partition as a rounding point. In information theory terminology a rounding point can be identified as a code word and the rounding set as the code book. In quantization, the objective is to generate a code book in a way that minimizes the distortion introduced by encoding. As explained earlier, this can be used as a data protection technique to minimize the disclosure risk.

As explained by Willenborg and De Waal (2012), the conventional way of defining the rounding points is to convert each \(x_i\) into a multiple of a given base value b as \(\lfloor x_i/b \rfloor * b\). For each rounding point \(r_i\), the number of original data points (\(x_i\)) that can be mapped to it are known as the set of attraction. This is a half open interval indicated as \([r_i - \frac{b}{2}, r_i + \frac{b}{2})\). All the values in the original dataset within the above mentioned interval will be represented by \(r_i\). It is necessary to decide on a suitable b value such that each set of attraction is sufficiently large enough (contains a least number of elements) in order to avoid disclosure. Further, this is explained as choosing the smallest b such that, \(min~\{ F^x(0,r_1 + \frac{b}{2}), \ldots ,F^x(r_n - \frac{b}{2} , x_{max}] \} \ge \alpha\) is achieved. \(F^x\) indicates the number of data points that fall within a given interval and \(\alpha\) indicates the minimum size of the set of attraction. However, it is difficult to decide a base value b in a way that the distance between \(|r_i - x_i|\) is minimized, and the required minimum protection is achieved.

In this work, we explore a few alternative approaches to generate the rounding sets. Here, we focus on the univariate case where the numerical attributes are masked one at a time. The SDC literature discusses masking methods with respect to both univariate and multivariate cases as they both have different benefits. In the univariate case, the overall distortion introduced by masking can be minimized as each variable is considered individually. However, this will intrinsically have a higher risk of disclosure compared to the multivariate case. Even though the multivariate masking can preserve the statistical relationships between the variables it would still introduce a high distortion. Moreover, grouping the variables for masking is not very straight forward.

  • Methods based on data discretization Data points are partitioned into k non overlapping intervals, and for each interval rounding points are chosen based on a given aggregation method e.g., on mean, median, re-sampling mean or cluster centroids.

    Here, we consider methods such as equal width discretization (EWD), equal frequency discretization (EFD), re-sampling based discretization (RBD) and K-means clustering based discretization (KMD). EWD and EFD methods are general data pre-processing techniques often used in data analysis and machine learning domains. RBD is a new approach that we introduce in this paper.

  • Methods based on microaggregation In this case, micro clusters are generated over a given set of data points ensuring a minimum of k items in each cluster. Then cluster centroids are selected as rounding points.

    Three univariate microaggregation methods are considered in this work. Maximum Distance to Average Vector (MDAV), Optimal Microaggregation (OMA) (Hansen and Mukherjee 2003), and the univariate implementation of Variable Distance to Average Vector (V-MDAV) method initially introduced for multivariate microaggregation.

Once the set of rounding values are identified, a given continuous variable can be quantized by either replacing each \(x_i\) using a deterministic approach or a stochastic approach. The methods mentioned above are explained in Sect. 3.

In the literature, rounding is explained as a statistical disclosure control method but we did not come across any experimental evaluations of the method. In this paper, we work to fill this gap. This will provide a clear understanding of rounding as a data masking tool and integrate rounding with discretization methods and microaggregation in a unified framework. We employ unsupervised and supervised discretization methods that have not been discussed before in the SDC literature in order to obtain the rounding points. Then we compare the results of different discretization methods discussed above in terms of their information loss (IL) and disclosure risk (DR).

This paper is structured as below. Related work is mentioned in Sect. 2 followed by in detail discussion of different rounding methods in Sect. 3. In Sect. 4, experimental setup and results are discussed. Impact of data rounding towards selected machine learning algorithms is discussed in Sect. 5, followed by discussion in Sect. 6 and conclusion in Sect. 7.

2 Related work

The SDC literature contains a plethora of micro-data protection methods also referred to as masking methods (Domingo-Ferrer 2008; Willenborg and De Waal 2012). Based on their operational principles, masking methods can be categorised into three categories as perturbative, non-perturbative and synthetic data generation (Torra 2017). Rounding is a perturbative data masking methods.

In this work, we are adopting data discretization methods to generate the rounding set. Data discretization is used widely in machine learning (ML) and knowledge discovery (Chmielewski and Grzymala-Busse 1996; Ibrahim and HacibeyoĞLU 2016; García et al. 2013; Dougherty et al. 1995; Ramírez-Gallego et al. 2016). Benefits of discretizing data as a pre-processing step include improvements in induction time, smaller sizes of induced trees/rules, enhanced predictive accuracy and the fact that most of the supervised learning algorithms require a discrete feature space (Pfahringer 1995; Yang and Webb 2002).

The operating principle of discretization can be used for numerical data masking in order to provide a privacy guarantee. Discretized data minimize the risk of disclosure, at the cost of information loss or decrease in analytical quality of data. Therefore, when designing a discretization method minimizing the information loss becomes paramount. In data discretization, original attribute values are mapped into a more generic, less precise values. Microaggregation (MA) is a clustering based SDC technique that is designed for continuous data masking. It is considered as a quantization mechanism in Ramírez-Gallego et al. (2013) and Willenborg and De Waal (2012).

Lloyd (1982) and Max (1960) algorithms are the earliest and foremost attempts at creating optimal quantizers which is similar to k-means clustering. Here, the underlying principle of clustering is used for quantizer designing which is the same notion as discretization. The work presented by Rebollo-Monedero et al. (2013) introduced a modified version of Loyd-Max algorithm that shows how the concept of quantization can be used to achieve privacy by creating k-anonymous quantizers. Another algorithm for k-anonymous microaggregation is introduced by Rebollo-Monedero et al. (2011) which is based on the concept of distortion-optimized quantizers. The notion of k-anonymity is introduced by Sweeney (2002). A dataset is said to satisfy k-anonymity for \(k > 1\), if each data record has at least \(k-1\) number of records sharing the same values for quasi identifiers. Data generalisation and local suppression are used to achieve k-anonymity while minimizing information loss. Having at least k records sharing the same values for quasi identifiers eliminate the risk of identity disclosure.

Data anonymization is studied as a vector quantization problem with respect to health data for minimizing individual or group re-identification from a released dataset (Miché et al. 2016). The paper proposes to use properties of vector quantization to anonymize a given dataset. Zhang (2011) discusses how fuzzy discretization can be used for protecting sensitive attribute values. Zhu et al. (2009) discuss how data discretization can be used for privacy preserving time series mining and the results indicate that data discretization causes a slight reduction of classification accuracy. However, they have not discussed the problem with respect to information loss or disclosure risk evaluation or comparatively analysing the different discretization techniques available. In our work we are, targeting to fill this gap with respect to continuous data.

3 Rounding

Rounding on micro data replaces the original continuous variable values (\(x_i\)) with selected rounding points \(r_i\), so that the distance \(|x_i-r_i|\) is minimized. All the rounding points obtained with respect to an attribute is known as a rounding set. Rounding can be either univariate or multivariate and deterministic or stochastic. In this work, we focus on deterministic-univariate rounding.

Rounding comprises three sub-processes (a) partitioning the dataset into quantization regions, (b) constructing the rounding set, and (c) encoding the original values with the nearest rounding points. Here, deriving the rounding set is the most critical step. Figure 1 depicts the process of obtaining the rounding set.

Fig. 1
figure 1

Obtaining quantization regions and rounding points for a given variable x. Each quantization region is an interval and is represented by a rounding point \(r_i \in {\mathbb {R}}\). A quantization region can be defined as \(Q_i =[a_{i-1}, a_i)\)

If rounding is considered as a process of quantization, there are two optimality conditions to be met (Lloyd 1982) .

  • nearest neighbour condition: select the quantization point such that for all \(x_i\) arg min \(d(x_i,q(x_i))\) where q() is an optimal quantizer.

  • centroid condition: each quantization point (rounding point) is the centroid of the quantization interval.

In the next section, we are exploring a few techniques that can be used to generate the rounding set keeping in mind the two optimality conditions of quantizer design.

3.1 Microaggregation for rounding

SDC methods are used for protecting micro-data so that the protected data can be released without a risk of disclosure. Here, we discuss microaggregation for deriving the rounding points.

This is a numerical data masking technique where the original values are divided into micro-clusters and then they are replaced by the cluster representatives. Parameter k defines the number of minimum data points required to form a micro-cluster. As the cluster centroid is used to replace the original values that fall into the particular cluster, the uniqueness of data records is concealed, thus preserving the privacy of the released data. The basic idea is to generate homogeneous clusters over the original data in a way distance between clusters are maximized so that the information loss can be minimized. Clusters are formed minimizing the sum of squared error (SSE) in groups.

Three variants of microaggregation (MA) are used to generate the rounding set as below.

  • MA based on MDAV (maximum distance to average vector) algorithm with “fixed” micro-cluster sizes.

  • MA based on V-MDAV (variable maximum distance to average vector) (Solanas et al. 2006) algorithm with “variable” micro-cluster sizes. This is initially introduced for multivariate microaggregation in Solanas et al. (2006). The micro-cluster sizes are between k and \(2k-1\). The algorithm is explained in Algorithm 2.

  • Optimal MA (OMA) (Hansen and Mukherjee 2003) based on the shortest path principle in graphs with “variable” micro-cluster sizes lies between k and \(2k-1\). This derives the optimal solution for the k partition problem with minimal distortion.

Here, micro-cluster centroids are considered as rounding points (\(r_i\)) and each of these points has at least a k number of points of attraction from the original dataset.

There exists a wide variety of MA algorithms to generate the anonymized data other than the ones mentioned above (Chettri et al. 2012; Abidi et al. 2018).

3.2 Discretization for rounding

The process of discretization is used to map continuous data with a high cardinality into a finite set of values. Discretization of a given continuous variable includes sorting the data values, partitioning them into non overlapping intervals based on a given condition and selecting suitable values to represent the data in each interval. Discretization methods can be categorised into two, as supervised and unsupervised based on whether the class information is utilized in the data partitioning process. In this work, we are employing the unsupervised methods discussed below for obtaining the rounding set as class information is not always available for a dataset. More specifically, they are used to partition the data.

3.2.1 Equal width discretization (EWD)

The range of a sorted variable is divided into c non-overlapping partitions which are equally sized. Intervals are equidistant and c is a user defined parameter. The width of the intervals are obtained as \(interval~width = (max_{x} - min_{x})/c\).

EWD is known to perform well on uniformly distributed data. However, with skewed distributions, this results in generating unbalanced intervals. Therefore, with EWD minimum protection cannot be ensured for rounding points obtained for each interval.

3.2.2 Equal frequency discretization (EFD)

Sorted values of a given variable are partitioned into c intervals in a way that each interval will roughly contain \(\frac{n}{c}\) number of values.

The issue of unbalanced interval raised by EWD can be resolved by adopting EFD. But, EFD is known to have some other drawbacks such as duplicate data points being assigned into different intervals while very dissimilar values can be put together in order to form intervals with a given frequency (Bennasar et al. 2012). Also, when the above issue is resolved by grouping all the duplicate values within the same interval, it is not always possible to generate intervals with equal frequency (c) (Jiang et al. 2009). Therefore, the flexibility of deciding the number of intervals (anonymity constraint—c) is limited in this case.

3.2.3 K-means clustering (KMD)

Univariate k-means clustering is used on a given variable to create c clusters while minimizing the sum of distances between data points and cluster centroids. Cluster centroids are considered as rounding points, and the original data within each cluster is replaced by the centroid values.

3.2.4 Re-sampling based discretization (RBD)

Here, we explore how re-sampling can be used for discretization. First, we derive a re-sampled dataset from a given original dataset and then apply k-means clustering on the re-sampled dataset for discretization. First the sorted, original dataset (\(sorted_{df}\)) is partitioned into x quantiles (in this work we use 10 quantiles). Then for each quantile \(Q_i\), m bootstrap samples (\(bs_{1\ldots m}\)) are extracted, where each \(bs_i\) is sized m. m is the size of the relevant quantile, that is \(m = |Q_i|\). For each bootstrap sample, \(bs_i\), its centroid (e.g., mean, median) is calculated. By repeating this process \(x*m\) times, the re-sampled dataset is generated which is then discretized by using k-means clustering. Here, the discretization step is tested with other unsupervised discretization methods like EWD and EFD. Based on the results use of k-means clustering outperforms them, and thus we have used it for discretization here. The algorithm is explained in Algorithm 1.

figure a

3.3 Determining the quality of rounding methods

The results of each discretization method are evaluated based on three criteria, (a) information loss (b) disclosure risk and (c) the accuracy of machine learning models built on the rounded data.

3.3.1 Information loss

In simple terms, information loss measures quantify how deviated the masked data from its original version based on the statistical properties and the distance between data points. In this case, we use two methods to measure information loss (IL).

The information loss metrics (ILMetrics) is introduced by Domingo-Ferrer and Torra (2001). In this case, the IL is calculated by averaging the mean variance of \(X-X', {\bar{X}}-{\bar{X}}', V-V', S-S'\) and the mean absolute error of \(R-R'\) and finally multiplying it by 100. The symbols are explained as follows, X—the original data file and \(X'\)—the masked data file; V and R are covariance and correlation matrix of X; and S denotes the diagonal of V where \({\bar{X}}\) is the variable averages for X. Similarly, the other sets of symbols indicate the same properties of the masked data file.

Another IL measure is IL1s which computes the standardised distances between masked data and the original data scaled by the standard deviation. That is \(IL1s = \frac{1}{mn} \sum _{j=1}^{m} \sum _{i=1}^{n} \frac{|x_{ij}-y_{ij}|}{\sqrt{2}S_j}\).

3.3.2 Disclosure risk

The main purpose of applying any SDC method is to minimize the risk of disclosure. Especially when personal data are handled, we want to make sure that the application of SDC methods mitigates the identity and attribute disclosures. Hence, an adversary with a substantial amount of auxiliary information would not be able to identify the data records with respect to a given data subject with certainty. In this case, two methods are employed to quantify the disclosure risk; (a) distance based record linkage (DDR) and (b) interval based disclosure (IDR).

In DDR, for each record in the masked data file, the distance to every record in the original data file is calculated. Then, the original data records that report the shortest distance to a given masked data record are considered as candidates for the linking process. A correct match is counted if the nearest record in the original data file is, in fact, the corresponding original record.

IDR defines an interval around the masked data (based on the standard deviation) and checks whether any original values fall within this interval (Templ 2017). Risk is measured as the number of times the above check is positive. However, the underlying concept of this is also the distance between masked data and original data. In order to distinguish this from DDR we refer to this as IDR in this work.

figure b

3.3.3 IL-DR score

As discussed above, IL is measured based on IL metrics and IL1s while disclosure risk is calculated based on IDR and DDR. These values obtained for each masked/discretized dataset is then used to derive a score as explained below (Domingo-Ferrer and Torra 2001).

$$\begin{aligned} {{\text {IL-DR}}}_{\text {score}}= & {} (IL~Metrics*0.25 ~ + ~ IL1s*0.25 \\&\quad + ~ 0.25*DDR ~+ ~0.25*IDR) \end{aligned}$$

IL-DR score gives equal weight to different IL (information loss) and DR (disclosure risk) measures we have obtained. Therefore, it is used to understand the trade-off between privacy (DR) and utility (IL). The lower the score, the better the particular discretization methods is.

3.3.4 Accuracy of ML models

This can also be considered as another way of measuring information loss. Here, we explore how rounding impacts the predictive accuracy of machine learning (ML) models. This is studied with respect to linear regression and decision trees. The prediction accuracy of the ML models trained on masked data is determined based on how accurately the models can predict the original data. Low prediction accuracy on original data indicates that the masked data used to train the models have a poor utility. For linear regression models mean squared error (MSE) and \(R^2\) score is used to evaluate the utility of the models built on masked data along with information loss measures discussed above. With respect to decision trees, mainly classification accuracy and entropy is used for the analysis. Decision tree classifiers are built using the RPARTFootnote 1 package available on R.

4 Methodology and experimental results

As explained previously, we evaluate different techniques that can be used to obtain the rounding set which is based on microaggregation or unsupervised discretization techniques. These methods are then evaluated with respect to distortion (information loss) and anonymity (disclosure risk) while changing other properties such as data distribution and dataset size. Finally, the impact of data masking on ML learning based data modelling is evaluated using linear regression and decision tree algorithms.

4.1 Data

For the experiments two types of datasets are used; (a) synthetic datasets following theoretical distributions, i.e., exponential, uniform and normal distributions, and (b) TarragonaFootnote 2, Boston housing informationFootnote 3, and WineFootnote 4 classification datasets which are openly available. All synthetic datasets are \(2 \times 500\) in dimension and generated using rexp, runif and rnorm functions available in R respectively to generate exponential, uniform, and normal distributions. Parameters for generating the synthetic datasets are as follows. Exponential distribution is generated with \(\lambda = 0.08\), normal distribution is generated with (\(\mu =0\), \(\sigma =1\)) and uniform distribution is generated with minimum and maximum values in the range of (0,1). Dimension of the other datasets are: Tarragona as \(834 \times 13\), Boston housing information \(506 \times 14\) and Wine classification as as \(179 \times 14\).

4.2 Results

In this Section, we have reported the results from the experiments. First, we compare the information loss (IL), and disclosure risk (DR) values obtained for different synthetic datasets. Comparisons are made for microaggregation based methods and unsupervised discretization methods. MDAV, V-MDAV and OMA are compared together as they are microaggregation based discretization methods, whereas the unsupervised discretization methods such as EWD, EFD, KMD, and RBD are compared together. Here, each of the above mentioned methods is used to obtain the rounding set. This can also be explained as partitioning a dataset based on specified criteria. Then for each partition, the representative values are selected and then the original data are replaced by them. IL is the distortion caused by this encoding process while DR indicates whether we can directly identify a given masked data record (after rounding is applied) with respect to its original record with a certainty. This can also be explained as the number of successful record linkages between the original dataset and the rounded dataset.

4.3 Setting anonymity constraints

We have selected the anonymity constraint values (k and c respectively for microaggregation and unsupervised discretization) in a way that approximately the same number of data points are used to form the micro-clusters in microaggregation and the intervals/ clusters in unsupervised discretization to obtain the rounding points. The relationship between k and c can be explained as below. The k and c values are set like this in order to compare the overall result at the end.

$$\begin{aligned}&k\approx \frac{\#~of~instances~in~attribute_{j}}{c} \\&c\approx \frac{\#~of~instances~in~attribute_{j}}{k} \end{aligned}$$

The selected set of values for parameter k in microaggregation are { 167, 100, 50, 34, 25, 20 }. These are approximately equal to the selected parameter values c for unsupervised discretization. They are respectively { 3, 5, 10, 15, 20, 25 }. In synthetic datasets each attribute has 500 instances.

Table 1 Comparison of IL and DR measures obtained for the exponentially distributed synthetic dataset with varying anonymity constraints (k)
Table 2 Comparison of IL and DR measures obtained for the uniformly distributed synthetic dataset with varying anonymity constraints (k)
Table 3 Comparison of IL and DR measures obtained for the normally distributed synthetic dataset with varying anonymity constraints (k)

4.4 Microaggregation for rounding

In the case of microaggregation, the number of data points per micro-cluster (k) is directly proportional to IL, and it is inversely proportional to DR. The higher the number of values in micro-clusters, the quality of the selected rounding points will be low resulting in high IL. This behaviour is vice versa, when a fewer number of values are used to form the micro-clusters.

Here, three MA based methods are compared with respect to different synthetic data distributions and the results are shown in Tables 12 and 3. As explained in Sect. 3.3.3 the lower the IL-DR scores the better the specific method is. With respect to exponentially distributed data VMDAV outperforms the other two microaggregation methods when evaluated based on the IL-DR Score (see column IL-DR Score on the above mentioned tables). When uniformly and normally distributed data are considered, OMA performs better than the other two MA methods. However, with respect to uniformly distributed data when the k value is low (i.e., k = 34, 25, 20) VMDAV performs better than OMA. With high k values, OMA performs better. With respect to normally distributed data when the k value is low, both OMA and MDAV perform equally. As we can see despite being the “optimal” method for microaggregation, OMA does not always produce the lowest IL-DR Scores. Compared to other methods, in most of the cases, OMA results in high disclosure risk (DR) ratios caused by the high utility of the rounded data. Eventually, this results in high IL-DR Scores which indicate a high privacy utility trade-off.

The same experiments are used on Tarragona dataset to measure IL and DR, as shown in Fig. 2. In this experiment, IL is also measured in terms of the sum of square error (SSE) and total sum of squares (SST). SSE indicates within the group homogeneity in micro-clusters. SST is the summation of, between the groups’ sum of squares (SSB) and SSE. Here the IL is calculated based on the following formula \(IL = \frac{SSE}{SST}*100\). The formulas are explained in detail in Chettri et al. (2012).

As shown in Fig. 2a, b both OMA and VMDAV results in low IL compared to MDAV. OMA reports the highest DR compared to the other two methods. When Consider the IL and DR trade-off VMDAV outperforms the other two methods. Most of the attributes in the Tarragona dataset are exponentially distributed. Based on the results obtained on the synthetic datasets, VMDAV is the most suitable approach for rounding when a dataset is exponentially distributed. The test results confirm this finding.

Fig. 2
figure 2

Microaggregation based discretization on Tarragona dataset

4.5 Unsupervised discretization for rounding

In the case of unsupervised discretization, the number of intervals/clusters (c) are inversely proportional to IL, whereas it is directly proportional to DR. The higher the number of intervals, the quality of the selected rounding points will also be high resulting a low IL and a DR. When the number of intervals is few, a large number of data points fall into a given interval thus introducing a high data distortion once a rounding point is selected for discretization.

A noteworthy point is that, in the case of unsupervised discretization a high value for the anonymity constraint c indicates low privacy whereas this behaviour in microaggregation based methods are vice versa. The reason for this is the quality of the selected rounding point which discretizes the values fall into a particular interval/cluster or a micro-cluster.

Here, four unsupervised discretization methods are evaluated with respect to rounding, namely re-sampling based discretization (RBD), equal width discretization (EWD), equal frequency discretization (EFD) and k-means based discretization (KMD). Tables 45 and 6 contains the results of using unsupervised discretization methods to obtain the rounding set with respect to different theoretical distributions. Generally, EWD outperforms the other methods with respect to exponential and normal distributed data. However, we analyse these results more closely. When the number of intervals are fewer (i.e., 3, 5) on exponentially and normally distributed data RBD performs better than the other methods as it reports the lowest average IL-DR score. For normally distributed data when the interval size is 3 (\(c=3\)) the IL-DR scores are respectively 60.76, 84.27, 166.76 and 254.56, for RBD, EWD, KMD and EFD. Exponentially distributed data also show that RBD reports the lowest IL-DR score of 69.73 when the interval size is 3, compared to 71.31 by EFD, 104.18 by EWD and 278.75 by KMD. For both normally and exponentially distributed data, when the number of intervals (c) is high (i.e., 25) EWD results in a low IL-DR score compared to the other methods. When the data are uniformly distributed, EFD outperforms the other methods.

In conclusion, with respect to a fewer number of intervals RBD performs better than the other methods when the data are normally or exponentially distributed. With the aforementioned distribution types, if the data are partitioned into a high number of intervals it is advisable to use EWD. However, if the data are uniformly distributed EFD is preferred irrespective of the number of intervals.

Figure 3 depicts the IL and DR analysis on Tarragona dataset when unsupervised discretization methods are used for obtaining the rounding set. A steady decrease in IL and gradual increment in DR can be noted when the data are split into a high number of intervals. As per the results, RBD performs poorly with regarding to IL. This is indicated by the high SSE ratio and IL metrics values shown in Fig. 3a, b. In this case, EWD outperforms the other methods in terms of IL and DR.

Table 4 Comparison of IL and DR measures obtained for the exponentially distributed synthetic dataset with varying anonymity constraints (c)
Table 5 Comparison of IL and DR measures obtained for the uniformly distributed synthetic dataset with varying anonymity constraints (c)
Table 6 Comparison of IL and DR measures obtained for the normally distributed synthetic dataset with varying anonymity constraints (c)
Fig. 3
figure 3

Unsupervised discretization methods on Tarragona dataset

4.6 Comparative analysis of the rounding methods

Here, we comparatively evaluate microaggregation and un-supervised discretization methods for rounding, based on their mean IL-DR Scores. In this case, an average value for IL-DR Scores are obtained over differing k or c values. As discussed earlier, the selected k or c parameter values ensure approximately a similar number of data points are used to form each micro-cluster or interval/cluster. Therefore, this comparison can be justified. Figure 4 depicts the results of applying different rounding methods on different synthetic datasets. It can be noted that compared to normally and uniformly distributed data, exponentially distributed data incurs a high privacy-utility trade-off. When considering microaggregation based methods, OMA is more suitable for normally and uniformly distributed data. For exponentially distributed data VMDAV is preferable. Out of unsupervised discretization methods, EWD is more suitable for exponentially or normally distributed data. EFD performs better when the data are uniformly distributed. Overall OMA, VMDAV, EWD methods are more suitable for obtaining the rounding set compared to other methods under consideration.

Fig. 4
figure 4

Mean IL-DR score for different rounding methods

5 Impact of rounding for modelling data

In this section, we explore the impact of rounding when data are modelled using machine learning algorithms. Two types of machine learning algorithms are used to build the models: (a) linear regression, and (b) decision trees. The quality of the models is evaluated based on the classification accuracy, \(R^2\) values and the mean squared error (MSE) based on their relevance. We compare the impact of applying different discretization methods based on the above mentioned evaluation criteria. The results of the different discretization methods are measured with varying anonymity constraint values (k for microaggregation and c for unsupervised discretization).

For the unsupervised discretization methods, apart from the user defined c values the number of bins are also decided based on FreedmanDiaconis rule which is widely used for deriving the interval width based on the following formula, \(width = 2*\frac{IQR(x)}{\root 1/3 \of {n}}\). The same formula is used to decide the micro-cluster size k, for each variable in the dataset as, \(k=\frac{number~of~data~points}{width}\). In this case, instead of processing each variable with the same anonymity constraint value (which is either the number of intervals (c) or the micro-cluster size (k)) it is determined per variable based on the FreedmanDiaconis rule.

5.1 Evaluating model accuracy

Microaggregation and unsupervised discretization methods are used to obtain the rounding set. For linear regression \(R^2\) and MSE values are used to evaluate the model utility. \(R^2\) value indicates the goodness of fit. It explains how close the actual data points are to the fitted regression line. In other words, this is the variation of the response variable that can be explained by the model. Therefore, a higher \(R^2\) value illustrates a model with a good fit for the underlying data. However, in this case \(R^2\) cannot be used for comparative analysis as they are measured on different independent training data sets obtained through different rounding methods. Instead, it is used to understand the relationship between the data and the respective models. In order to compare the model utility MSE is used, and this is calculated as \(MSE = \frac{1 }{n} \sum _{i=1}^n (y-{\hat{y}})^2\). Here, y indicates the original values of the response variable before rounding, and \({\hat{y}}\) indicates the predicted values. The lower the MSE, the better the models are.

In the case of decision trees, classification accuracy is used for the comparison. Here, two types of accuracy figures are obtained. \(Acc_V\) is the validation accuracy of a given model on the rounded testing data. \(Acc_O\) is the classification accuracy of the model, with respect to original (un-rounded) data. Here, \(Acc_O\) is used for comparison.

5.2 Experimental setup

As explained earlier, in each scenario a different discretization method is used to obtain the rounding set, and then the original values are encoded using them. When applying EFD some variables cannot be partition exactly into the specified number of intervals c. This is due to the fact that unique partitions cannot be created for the variable when a particular c value is determined. In such cases, the next highest number of intervals are selected for discretization. The same approach was used when partitioning data into x number of quantiles in RBD.

Usually, SDC/masking methods are applied only to the quasi identifiers-the variables that work as indirect identifiers and can be used for re-identification or record linkage-and the sensitive (dependent) variables are left in their original form. By masking such variables their uniqueness is concealed, thus limiting the risk of disclosure. However, in these test scenarios we have considered all the continuous variables as quasi identifiers and masked them using microaggregation and discretization methods. For the datasets used to train decision trees, the sensitive variable (class variable) is left as it is since they are categorical in nature.

In the case of linear regression datasets, the sensitive (dependent) variable is also numerical. From the preliminary test results, it was noted that when all the continuous variables of a given dataset is masked (fully discretized), including the sensitive (dependent) variable, the trained model’s utility is better than (low MSE value) leaving the sensitive (dependent) variable unmasked. For example, on Boston housing prices dataset when the LR models are built on a fully discretized dataset, the utility of the discretized models are equal or better than the original model \(\frac{26}{55}\) times. As opposed to the above, when the sensitive (dependent) variable is not masked, always the original model reports a better utility value. This is resulted by the strengthened correlation among the variables when the same treatment is applied to the predictive variable. Therefore, in the LR test cases the datasets are fully masked.

5.3 Results

Table 7 illustrates the results of linear regression on rounded training data with varying anonymity constraint values (k or c). When the results are compared based on the MSE values, it can be seen that MDAV reports the highest number of instances where the MSE values are lower than or equal to the original model (baseline model). For a k parameter value as high as 25, all the MA based methods are reporting MSE values less than the original model, indicating that the discretized models are not only providing a privacy guarantee but also results in high predictive accuracy. Here, the MSE values (prediction accuracy) are obtained from  the models built on rounded/discretized data which are then used to obtain the predictions on the original data. The lowest MSE values of 14.57 and 17.88 are reported by EFD when the parameter c is set to 3 and 10 respectively. Both EWD and RBD provide a higher number of instances where the predictive accuracies are better than the original case. Moreover, 3 out of 4 cases of using Freedman–Diaconis rule with respect to unsupervised discretization methods also result in better predictive accuracies compared to the baseline model. However, none of the KMD instances were able to achieve the baseline predictive accuracy or better. As shown by the results the IL loss values (shown by IL metrics) are not directly related to the predictive accuracy of the models. On average MA based methods work better than the unsupervised discretization methods.

Predictive accuracy of the ML models is not that susceptible to the information loss caused by discretization. The discretized data are deviated from its original form, but the statistical properties required to model the data are preserved so that they can be used to train useful ML models. In the case of LR we have obtained the mean absolute correlation that summarizes the correlation matrix of a given data file. As indicated by the results most of the anonymity constraint value and the masking methods combinations maintain their correlation values within \(\pm 0.2\) from the original value. Generally, the higher correlation results in lower MSE and vice versa.

Table 8 illustrates the results of applying rounding on decision tree classifiers. Two accuracy measures are taken for the evaluation purpose. \(Acc_V\) indicates the classification accuracy of a given ML model on its test data, and \(Acc_O\) indicates the classification accuracy of a given ML model with respect to the original data. In this case, models are built on the rounded/ discretized data and evaluate on the original data. This criterion is used as an evaluation measure to understand the utility of the models built on masked data.

The average \(ACC_O\) values reported by each method is as follows; OMA—0.9289, MDAV—0.93, VMDAV—0.936, EWD—0.95, EFD—0.946, KMD—0.918 and RBD—0.924. As shown by the results EWD, EFD and VMDAV methods perform better than the rest. Showing the same pattern as previous KMD performs poorly. On the discretized/rounded dataset we derived the entropy values based on Shannon’s entropy for each variable and sum it up to obtain the total entropy for each variable. As it is shown by the results when data are discretized, the entropy is decreased. Moreover, entropy is inversely related to information loss. As anonymity constraint value (k) increases in MA methods the entropy drops gradually, and the opposite behaviour can be seen with respect to unsupervised discretization methods. However, low entropy does not impact the validation accuracy (\(ACC_V\)) of the decision tree models. The correlation between \(ACC_V\) and entropy is 0.14 which indicates a negligible positive relationship. When correlation is measured between \(ACC_O\) and entropy it shows a moderately inverse relationship of \(-\) 0.54 where the accuracy increases as the entropy decrease. This behaviour can be explained as below. Low entropy levels indicate less amount of information or less uncertainty in data. This can also be attributed to the less diversity in the underlying data. When continuous data are less diverse due to discretization (as many unique data points are replaced with centroids), the ML models derived from such data are more generalized compared to the models built on original data as they are not over-fitted. Therefore, the DT models built on discretized data still shows a fairly good accuracy when they are used to predict previously unseen data despite the IL incurred in the process.

In these experiments, we have also employed two supervised discretization methods that use class information in order to discretize continuous data. The two methods are, namely discretization using minimum description length principle (MDLP) and discretization using ChiMerge (ChiM) algorithm. MDLP is an entropy based method whereas ChiM uses \(\chi ^2\) statistics to determine the discretization points. Many empirical studies conducted in the literature have shown that supervised discretization methods are more effective compared to the unsupervised discretization methods in terms of maintaining a high predictive accuracy when training and validation accuracies are determined. In this case, we are also interested in measuring \(Acc_O\) which indicates the utility of a discretized model against the original data when supervised discretization methods are used. Also, we want to explore whether supervised discretization methods can be used as an alternative for generating the rounding set. When discretized using supervised methods, validation accuracy (\(Acc_V\)) is equal or greater than the original model. However, when the original data are classified using the discretized ML models (\(Acc_O\)) it can be seen that the accuracies are far lower than any other method. Nevertheless, low entropy values are noted in this case. However, the reduction of entropy has to be done carefully. Otherwise, as shown by the outcomes of the supervised discretization methods, a significant reduction of entropy can lose useful information so that the minimum amount of information required to learn the model is no longer available. Thus, the models built on such data inherits a poor predictive accuracy when tested with new data. This indicates that for rounding supervised discretization methods are not ideal.

When the overall results are considered it can be seen that Freedman–Diaconis rule provides a very close accuracy to the original case despite the dataset or the discretization methods adopted. Interval based DR on Boston housing dataset when the Freedman–Diaconis rule is applied vary from 0.40 to 0.83 (OMA-0.5889, MDAV-0.5945, VMDAV-0.5561, EWD-0.8346, EFD-0.7929, KMD-0.4996 and RBD-0.40612), which provides a good privacy-utility trade-off compared to the other methods. Therefore this method can be considered when selecting anonymity constraint values as it provides a good accuracy while minimizing the disclosure risk. Especially, in the cases where the data owners do not have a clear insight on what value should be selected as the anonymity constraint (degree of privacy).

We have also performed experiments with iris and faithful datasets, but for the sake of conciseness in the discussion it is restricted to the ones mentioned above. Results of the datasets iris and faithful were similar to the above.

Table 7 Linear regression on Boston housing prices dataset
Table 8 Decision tree models trained on Wine classification dataset

6 Discussion

In this work, we have explored different discretization methods in order to mask the numerical data. Based on the results it can be concluded that microaggregation (MA) based methods incur a low IL compared to unsupervised discretization methods and intuitively results in an improved DR. A single discretization technique or an anonymity constraint value cannot be determined as the best in addressing the privacy-utility trade-off, as the nature of the underlying data impacts that. However, methods like OMA, VMDAV and EFD perform well in most of the instances. For normal and exponentially distributed data with a small number of intervals (c), RBD (Re-sampling based discretization) is suitable over the other methods as it incurs a low privacy-utility trade-off. For a uniformly distributed dataset EWD or EFD can be used to obtain the rounding set with minimal privacy-utility trade-off. On average, uniformly distributed data can be discretized with minimal IL compared to other data distributions we have checked here. Normally distributed data incurs the highest IL whereas exponentially distributed data reports the highest DR despite the rounding method we use. Therefore, examining the data distribution beforehand can be helpful in deciding the rounding method and privacy parameters such as interval/cluster size, aggregation method etc. Instead of using a fixed anonymity constraint value for all the variables, we can define the size of k or c per each variable in the case of univariate discretization. We have illustrated some example cases in the experiments using Freedman–Diaconis rule to determine the anonymity constraint without having to specify it by the users. The experiments show that in both unsupervised discretization and MA, the outcome of using the above approach is very close to the baseline results.

As discussed earlier, releasing of rounded/discretized data helps to mitigate the disclosure risk. However, this results in an IL which directly impacts the analytical value of the underlying dataset. In this work, we explored how IL caused by rounding can influence the predictive power of the machine learning models. It seems that IL does not necessarily result in poor predictive accuracy in ML models. In many cases, rounding improves the model utility as it reduces the noise in the data so that the ML algorithm can learn without the risk of over-fitting.

For example, data owners release a perturbed version of original data to the data analysts in order to minimize the risk of disclosure. Assume the data masking method used is carefully tuned so that the analytical value of data is not completely destroyed. In this case, the models trained on such data should also have good predictive accuracy, maybe with a slight reduction compared to the original model. When we build a ML model, one of the main concerns is to avoid model over-fitting. If the generated models are more generalized towards the training data a better accuracy can be seen when new data are classified using these models. In ML, this is mainly achieved through regularization. In our case, generalized data are used to train the models with the expectation that it would result in simple but accurate models. The other advantage is that these data masking techniques also guarantee a degree of privacy for the data in use. This privacy and utility trade-off in model building can be justified if we are dealing with sensitive information. Considering the above mentioned facts, it can be concluded that models built on rounded data are generalized, thus it secures a good predictive accuracy.

7 Conclusion

“Rounding” is a numerical data masking technique which has not been discussed previously in the literature with empirical results. The operating principle of rounding can be seen as discretization or quantization where the continuous values are mapped into a discrete space. In this work, we discuss rounding in a unified way with unsupervised discretization and microaggregation where these methods are used to generate the rounding sets. Also, we have introduced a re-sampling based discretization method for continuous data which works better with a small number of intervals thus minimizing the disclosure risk. Then these methods are evaluated based on their information loss and disclosure risk with respect to theoretical distributions and real world data. Finally, the rounded data are used to train linear regression, and decision tree models and the impact of rounding towards model accuracy is discussed. Based on the results, it can be concluded that generally, microaggregation based methods are more suitable for deriving the rounding set. However, based on the data distribution in some cases unsupervised discretization methods outperforms microaggregation methods. Also, we have used Freedman–Diaconis rule to define the anonymity constraint value per each attribute and shown that this method can be used to minimize disclosure risk while maintaining a model utility closer to the benchmark (original) model.

This work is focused on univariate, deterministic rounding. In future work, it will be interesting to explore multivariate and stochastic rounding based on the above discussed methods. Also, a study on different aggregation methods that can be used to obtain the centroids/rounding points will be interesting in terms of managing the IL and DR.