Introduction

“Information is the oil of the 21st century, and analytics is the combustion engine.”—Peter Sondergaard, Senior Vice President and Global Head of Research at Gartner, Inc.

As the above quote highlights, data analysis and prediction have become the cornerstone of corporate and public policy. While powerful insights can be obtained when granular data—often about individuals—are shared for research, concerns about the privacy of such granular data limit society’s potential to put it to optimal use. Individuals’ privacy can get compromised even when their data is shared with the individual’s approval and is stripped of personal identifiers. Na et al. [1] Show how researchers could re-identify 95 percent of individual adults from the National Health and Nutrition Examination Survey using machine learning techniques; [2] similarly show a high re-identification rate of data. A prominent example is of Governor William Weld, the former Governor of Massachusetts, who was subjected to such re-identification using a linkage attack [3].

Differential privacy has emerged as a technique to ensure privacy of individuals in a dataset, even when their data is shared publicly [4]. As differential privacy changes the process for accessing data, rather than the database itself, it enables individuals’ privacy even when the data is subjected to various attacks on privacy. The use of differential privacy by the 2020 U.S. Census signals a seminal change in government statistics [5]. Leading corporations and governments have started employing differential privacy into their datasets; see [5,6,7].

In the last few years, there has been an explosion of research articles that apply differential privacy to various functional areas such as healthcare [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], learning [23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38], location-based services [39,40,41,42,43,44,45,46,47], internet-based collaboration [48], Internet of Things [49,50,51], block-chains [52,53,54], cyber-physical systems [55,56,57,58], neural networks [59], social media and social network analysis [60,61,62], crowd-sourcing [63,64,65], and mobile edge computing environments [66, 67]. Pejó and Desfontaines [68] study the numerous variants and extensions to adapt differential privacy to different scenarios and attacker models.

Existing research in this area emphasizes an intrinsic trade-off between the privacy of a dataset and its utility for analytics. In their survey of the privacy literature, [69] describe this trade-off as “differential privacy provides either very little privacy or very little utility or neither.” In contrast to such existing literature, this paper shows that differential privacy can be employed to precisely—not approximately—retrieve the associations in the original dataset. As viable methods for protection of privacy that do not impinge on the quality of data analytics are cardinal to our increasingly data-reliant and privacy-conscious society, our study makes an important contribution by highlighting that differential privacy can enable privacy while simultaneously preserving the quality of data analytics as in the original data.

We examine, conceptually and empirically, the impact of noise addition using differential privacy on the quality of data analytics on a modified dataset, i.e. a dataset with noise. As associations between the dependent and independent variables are typically captured using the slope parameter in a regression, we examine the impact of noise addition on the slope parameter. We obtain two key results. First, the accuracy of analytics following noise addition increases with the privacy budget and the variance of the independent variable. Second, the accuracy of analytics following noise addition increases disproportionately with an increase in the privacy budget when the variance of the independent variable is greater. To test these two predictions, we use actual data, where both the dependent and explanatory variables are private. We add Laplace noise to both these variables and then compare the slopes in the original and modified datasets to provide evidence supporting these two predictions. We thus conceptually and empirically establish that the utility-privacy trade-off exists in differential privacy.

We then ask the central question in this study: Can this utility-privacy trade-off be overcome using differential privacy? We highlight that differential privacy can ensure precise data analytics even while preserving the privacy of the individuals in a dataset, provided the noise added satisfies the following criteria. If the dependent variable and an explanatory variable are both private variables, three conditions must be satisfied. First, noise added to dependent variable is independent of the explanatory variable. Second, noise added to the explanatory variable is independent of the dependent variable. Third, these noises are, in turn, independent of each other. Given these criteria, we show that once the privacy budget employed to construct a differentially private dataset is declared, the original slope parameter can be precisely retrieved using the variance of the independent variable and the slope parameter estimated using the modified dataset. Critically, we demonstrate these results by being agnostic about the nature of the statistical distribution from which the noise is added to achieve differential privacy. As revealing the privacy budget used to arrive at the differentially private dataset does not necessarily compromise the privacy of the dataset, differential privacy can enable us to overcome the utility-privacy trade-off.

If only the dependent variable is private while the explanatory variable is a public variable, noise needs to be added to only the dependent variable. In this case, if noise added to dependent variable is independent of the explanatory variable, the original slope parameter is identical to the estimate generated using the modified dataset; this result is again agnostic to the nature of the statistical distribution from which the noise is added to achieve differential privacy.

Our study makes an important contribution to the differential privacy literature. In their survey of the privacy literature, [69] classify differential privacy, k-anonymity, l-diversity and t-closeness as the techniques that employ input privacy for data mining. Outlining the advantage of differential privacy through the contributions of [10, 70, 69] highlight that “differential privacy is becoming a popular research area as it guarantees data privacy... (and) ensures utility as noise addition is minimal thus providing a close approximation of original results.” (emphasis added) However, outlining its disadvantages, they write “differential privacy provides either very little privacy or very little utility or neither.” A similar belief was expressed in [71], where they mention “It is believed that certain paradigms such as differential privacy reduce the information content too much to be useful in practical situations.” (pp. 322) In contrast, our study shows that by declaring the privacy budget used in generating a differentially private dataset, precise—not approximate as claimed in [69]—data analytics can be performed using the modified dataset even while preserving its privacy.

Within the scope of the utility-privacy trade-off, our study contrasts:

  1. 1.

    The claim in [69] that “differential privacy provides either very little privacy or very little utility or neither.” Our study shows that both privacy and utility can be obtained using differential privacy.

  2. 2.

    The thesis in [72] that techniques for privacy preservation have “a noticeable impact of privacy-preservation techniques in predictive performance.” Our study shows that differential privacy can ensure no noticeable impact of privacy-preservation techniques in predictive performance.

  3. 3.

    The concern raised in [5] with respect to the use of differential privacy by the 2020 U.S. Census that “transition to differential privacy has raised a number of questions about the proper balance between privacy and accuracy in official statistics.” Our study shows that these concerns about the balance between privacy and accuracy—with respect to analytics using the census data—may be misplaced.

Our study also contributes to the literature on privacy preserving data analytics. Zhang et al. [73] survey the literature on privacy preserving association rule mining, especially focusing on the present methodologies for the same. Ahluwalia et al. [74] study association rule mining where mining is conducted by a third party over data located at a central location is updated from several source locations. We show that differential privacy can be used to completely preserve the utility of data analytics, while ensuring the privacy of data.

The paper is structured as follows. Section  analyzes the effects of noise addition on the accuracy of analytics. Section  postulates the key result in our study. Section  concludes the paper.

Effect of noise addition using differential privacy on data analytics

Following [4, 75], \(\epsilon\)-differential privacy is defined formally as follows. If \(\epsilon\) is a positive real constant, A is a randomised process, D and D\(^{\prime }\) are databases that differ by the data of one individual, and O is some output of the process A, then \(\epsilon\)-differential privacy is defined as:

$$\begin{aligned} {[}A(D)=O]\le e^\epsilon \cdot P[A(D^{\prime })=O] \end{aligned}$$
(1)

The smaller \(\epsilon\) is, the closer the probabilities above are, and, therefore, the more differentially private the process is. Conversely, a higher \(\epsilon\) implies a less differentially private process.

Having defined \(\epsilon\)-differential privacy, we now study the central thesis of this paper: the purported trade-off between utility and privacy of a differentially private dataset. As the correlations between dependent and independent variables—in univariate or multivariate settings—are most important in data analytics, we study the effect of adding noise to enable differential privacy on the correlations as measured by the slope parameter in a regression.

We first consider the case where both the dependent variable y and the independent variable x are private.

Adding noise to private dependent and independent variables: conceptual analysis

Denote \(\Lambda (\mu ,\sigma )\) as a function that finds a random value from a distribution with mean \(\mu\) and standard deviation \(\sigma\). We use \(\Lambda ^{\prime }(\mu ,\sigma )\) to denote a random draw that is different from \(\Lambda (\mu ,\sigma )\). We add noise from a distribution with \(\sigma =\frac{\alpha }{\epsilon }\), where \(\alpha\) is a constant and \(\epsilon\) is the privacy parameter, to both dependent and independent variables in an ordinary least squares (OLS) regression.Footnote 1 We get the following equation:

$$\begin{aligned} \left[ y_i+\Lambda ^{\prime }\left( \mu ,\frac{\alpha }{\epsilon }\right) \right] =\beta _0+\beta _1\cdot \left[ x_i+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right] +\nu _i, \end{aligned}$$
(2)

where the estimate of \(\beta _1\) is given by:

$$\begin{aligned} \beta _1=\frac{covar(x,y)}{var(x)}, \end{aligned}$$
(3)

where covar(xy) denotes the covariance between random variables x and y and var(x) denotes the variance of random variable x. Similarly, after noise addition, the new slope parameter \(\beta _1^{\prime }\) equals:

$$\begin{aligned} \beta ^{\prime }_1=\frac{covar\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) ,y +\Lambda ^{\prime }\left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} }{var\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} } \end{aligned}$$
(4)

The denominator equals

$$\begin{aligned} var\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} =var(x)+var\left\{ \Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} +2 \cdot covar\left\{ x,\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} \end{aligned}$$
(5)

Now, as \(\Lambda (\mu ,\frac{\alpha }{\epsilon })\) is independent of x, the covariance between them equals 0 while variance of \(\Lambda (\mu ,\frac{\alpha }{\epsilon })\) equals \(\frac{\alpha ^2}{\epsilon ^2}\). Therefore, the denominator simplifies as:

$$\begin{aligned} var\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} =var(x)+\frac{\alpha ^2}{\epsilon ^2} \end{aligned}$$
(6)

The numerator equals

$$\begin{aligned}&covar\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) , y+\Lambda ^{\prime }\left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} =covar(x,y)+covar\left\{ x,\Lambda ^{\prime } \left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} \nonumber \\&\quad + covar\left\{ \Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) , y\right\} +covar\left\{ \Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) , \Lambda ^{\prime }\left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} \end{aligned}$$
(7)

We use the fact that \(\Lambda (\mu ,\frac{\alpha }{\epsilon })\) and \(\Lambda ^{\prime }(\mu ,\frac{\alpha }{\epsilon })\) are independent of each other, y is independent of \(\Lambda (\mu ,\frac{\alpha }{\epsilon })\), and x is independent of \(\Lambda ^{\prime }(\mu ,\frac{\alpha }{\epsilon })\). So, except the first covariance, the other three covariances are zero. So, the numerator simplifies as:

$$\begin{aligned} covar\left\{ x+\Lambda \left( \mu ,\frac{\alpha }{\epsilon }\right) ,y +\Lambda ^{\prime }\left( \mu ,\frac{\alpha }{\epsilon }\right) \right\} =covar(x,y) \end{aligned}$$

Using the simplified numerator and denominator, we get

$$\begin{aligned} \beta ^{\prime }_1=\frac{cov(x,y)}{var(x)+\frac{\alpha ^2}{\epsilon ^2}} \end{aligned}$$
(8)

Using Eqs. (3) and (8) and denoting var(x) as \(\sigma _x^2\):

$$\begin{aligned} \frac{\beta _1}{\beta _1^{\prime }}=1 +\frac{\alpha ^2}{\epsilon ^2 \sigma _x^2} \end{aligned}$$
(9)

Clearly, \(\beta ^{\prime }_1<\beta _1\), which leads to the following result:

Result 1

If the dependent variable y and an explanatory variable x are both private variables, then the slope parameter used in data analytics is lower in magnitude after noise addition when compared to the slope parameter in the original dataset.

Figure 1 shows how the ratio \(\frac{\beta ^{\prime }_1}{\beta _1}\) varies with the variance of x over different values for \(\epsilon\).

Fig. 1
figure 1

Change in accuracy of analytics following noise addition

In Fig. 1 and in (9), we notice that, for a given value of \(\epsilon\), as the variance in x increases, \(\beta ^{\prime }_1\) approaches \(\beta _1\). Similarly, for a given value of \(\sigma _x^2\), \(\beta ^{\prime }_1\) approaches \(\beta _1\) as \(\epsilon\) increases. Thus, we get the following two results:

Result 2A

If the dependent variable y and an explanatory variable x are both private variables, then the accuracy of analytics following noise addition increases with increases in the privacy budget (\(\epsilon\)) and in the variance of the independent variable (\(\sigma _x^2\)).

$$\begin{aligned} \frac{d(\beta _1^{\prime }/\beta _1)}{d\epsilon }>0,\, \frac{d(\beta _1^{\prime }/\beta _1)}{d(\sigma _x^2)}>0. \end{aligned}$$
(10)

Result 3A

If the dependent variable y and an explanatory variable x are both private variables, then the accuracy of analytics following noise addition increases disproportionately with increase in privacy budget (\(\epsilon\)) when the variance of the independent variable (\(\sigma _x^2\)) increases:

$$\begin{aligned} \frac{d^2(\beta _1^{\prime }/\beta _1)}{d\epsilon \cdot d(\sigma _x^2)}>0. \end{aligned}$$
(11)

Adding noise to private dependent and independent variables: empirical evidence

In this section, we analyze the effect of noise addition to satisfy differential privacy to private dependent and independent variables on the accuracy of data analytics. We focus on two popular techniques used for analysis of data: ordinary least squares (OLS) regression and difference-in-difference estimation using panel data techniques ([76], Ch. 5, [77]). The estimate of the slope parameter in a difference-in-difference analysis resembles the discrete equivalent of a second-order derivative of the dependent variable w.r.t. the independent variable. Therefore, the variance of the independent variable in a difference-in-difference analysis is significantly greater than the variance of the independent variable in an OLS analysis. Thus, the simultaneous use of OLS and difference-in-difference estimation on the same dataset enables us to proxy the changes in the variance of the independent variable. In contrast, using two different datasets to proxy the changes in the variance of the independent variable would introduce other differences that may confound the empirical analysis. Thus, in our empirical analysis, the OLS estimates proxy the dataset with a lower variance for the independent variable while the difference-in-difference estimates proxy the dataset with a higher variance for the same.

We add noise from a Laplace distribution to a real dataset containing data on vaccination and health outcomes across all the states in the United States. The data on health outcomes—which includes the total number of cases per million people, the total number of deaths per million people, and the percentage case fatality rate—is collected from Oxford University’s COVID-19 Government Response Tracker. The data on vaccination is collected from ourworldindata.org. The time period of the data is from Jan-2021, when vaccination first began in the U.S. to Apr-2021, when we had collected the data. We add noise to this data using nine different values of epsilon: 0.25, 0.5, 1, 2, 3, 4, 5, 6, and 7.

In both analyses, OLS and difference-in-difference, we compare the slope parameter we obtain on the noisy dataset (\(\beta _1^{\prime }\)) with the results obtained on the original dataset (\(\beta _1\)). We do this comparison for each value of epsilon to find the most accurate epsilon, the one with the least difference in the slope parameters vis-à-vis the original. We repeat this 100 times and aggregate the results to find the most accurate epsilon value. For each epsilon value, we also find the average of the squared difference in the slope parameters (\(\beta _1^{\prime }\) and \(\beta _1\)) over the 100 repetitions to find the effects of noise addition.

Fig. 2
figure 2

Most accurate \(\epsilon\) for OLS regression

Fig. 3
figure 3

Most accurate \(\epsilon\) for difference-in-difference estimates

The results of our analysis are shown in Figs. 2 and 3, which display a bar chart with the number of times each epsilon value was most accurate for OLS and difference-in-difference respectively. These figures confirm our theoretical prediction in Result 2A that an increase in the value of \(\epsilon\) increases the accuracy of data analytics. Thus, we state the following result from our empirical analysis:

Result 2B

As theoretically predicted in Result 2A, the empirical analysis using actual data confirms that the accuracy of analytics following noise addition increases with increase in the value of \(\epsilon\).

Figures 4 and 5 display a bar chart with the average sum of squares error in the slope parameters for the OLS and difference-in-difference respectively for each value of epsilon. We observe that the average difference over 100 iterations between the result obtained using the noisy dataset and the original dataset monotonically decreases with an increase in epsilon. This result is consistent with what we found using the earlier Figs. 2 and 3 charts that displayed the most accurate epsilon.

Fig. 4
figure 4

Accuracy of analytics using ols regression for varying \(\epsilon\)

Fig. 5
figure 5

Accuracy of Analytics using Difference-in-difference Estimates for Varying \(\epsilon\)

Table 1 Average square of differences in slope parameter in original dataset and in dataset with noise addition using ordinary least squares (OLS) and difference-in-difference (DiD)

Table 1 displays the exact values for the average squares of differences between the slope parameters. We observe that as the value of epsilon rises, the average squares of differences falls monotonically—by approximately 6 times for ordinary least squares and approximately 200 times for difference-in-difference. Thus, there is a disproportionately larger drop in the average differences for the difference-in-difference estimate when compared to the estimates obtained using the ordinary least squares. is consistent with that of the theoretical analysis. Result 3A predicts that as the variance of the independent variable increases, an increase in \(\epsilon\) disproportionately increases the accuracy. Therefore, we find that the empirical evidence when comparing the change in accuracy with \(\epsilon\) for the difference-in-difference analysis versus that for the ordinary least squares analysis is consistent with Result 3A:

Result 3B

As theoretically predicted in Result 3A, the effect of an increase in the value of \(\epsilon\) is disproportionately more in a difference-in-difference analysis than in an ordinary least squares regression.

Thus, the empirical analysis clearly confirms that as the privacy budget (\(\epsilon\)) increases, the utility of the data analytics—as measured by the slope parameter capturing the association between the dependent and independent variables—declines. On the other hand, the definition of differential privacy as in equation (1) shows clearly that as the privacy budget (\(\epsilon\)) decreases, the data becomes more private. Thus, our conceptual and empirical analysis clearly demonstrates the intrinsic trade-off between privacy and utility when employing differential privacy. This trade-off has been described in [69], who outline the disadvantages of differential privacy when they note that “differential privacy provides either very little privacy or very little utility or neither.” Similarly, [72] highlight that techniques for privacy preservation have “a noticeable impact of privacy-preservation techniques in predictive performance.”

Adding noise to only private dependent variable: conceptual analysis

Having analysed the case where both the dependent and independent variables are private, we now examine the impact on data analytics in the case where the independent variable is a public variable and so there is no need to add noise to the same to preserve privacy. As the case where the dependent variable is a public variable is not interesting from the perspective of data analysis, we ignore that case; we note, however, from the formula for the slope parameter in Eq. (3) that when no noise is added to the dependent variable, the slope parameter remains unchanged. In this case, we get the following equation:

$$\begin{aligned} \left[ y_i+\Lambda ^{\prime }\left( 0,\frac{\alpha }{\epsilon }\right) \right] =\beta _0+\beta _1\cdot x_i+\nu _i, \end{aligned}$$
(12)

After noise addition, the new slope parameter \(\beta _1^{\prime }\) equals:

$$\begin{aligned} \beta ^{\prime }_1=\frac{cov\left( x,y+\Lambda ^{\prime } \left( 0,\frac{\alpha }{\epsilon }\right) \right) }{var(x)} \end{aligned}$$
(13)

Replicating the steps as shown in section , we find that

$$\begin{aligned} cov\left\{ x,y+\Lambda ^{\prime } \left( 0,\frac{\alpha }{\epsilon }\right) \right\} =cov(x,y) \end{aligned}$$

Therefore, in the case where the independent variable is a public variable,

$$\begin{aligned} \beta ^{\prime }_1=\beta _1 \end{aligned}$$
(14)

This leads to our next result:

Result 4

When the independent variable is a public variable, the slope parameter remains unchanged after noise addition.

Key advantage of differential privacy for data analytics: precise analytics without losing privacy

Having conceptually and empirically demonstrated the trade-off between utility of differential privacy for data analytics and its ability to preserve privacy, we ask the central question in this study: Can this trade-off be avoided? As the slope parameter remains unchanged in the case where the independent variable is a public variable, we focus only on the case where both the dependent and independent variables are private. In this case, we highlight that differential privacy ensures the precision of data analytics even while preserving the privacy of the individuals in a dataset.

Combining Eqs. (6) and (9), the slope parameter in the original dataset can be regenerated from the slope parameter in the modified dataset using the variance of the independent variable in the modified dataset \(\sigma _{x^{\prime }}^2\) and the privacy budget \(\epsilon\) as follows:

$$\begin{aligned} \beta _1=\beta _1^{\prime } \left\{ \frac{\epsilon ^2 \sigma _{x^{\prime }}^2}{\epsilon ^2 \sigma _{x^{\prime }}^2-\alpha ^2}\right\} \end{aligned}$$
(15)

Thus, given the level of differential privacy employed in the modified dataset, i.e. with noise addition, the original slope parameter can be accurately retrieved using the variance calculated for the independent variable in the modified dataset \(\sigma _{x^{\prime }}^2\) and the slope parameter estimated in the modified dataset \(\beta _1^{\prime }\) provided the following criteria are satisfied. If the dependent variable and an explanatory variable are both private variables, three conditions must be satisfied. First, noise added to dependent variable is independent of the explanatory variable. Second, noise added to the explanatory variable is independent of the dependent variable. Third, these noises are, in turn, independent of each other. If only the dependent variable is private while the explanatory variable is a public variable, only one condition must be satisfied: noise added to dependent variable is independent of the explanatory variable.

Thus, in contrast to this prevailing wisdom on the disadvantage of differential privacy, our study shows that by declaring the privacy budget used in generating a differentially private dataset, the slope parameters in the original dataset can be retrieved precisely. Thus, our paper is the first to show that differential privacy provides a precise replication (not approximation as claimed in [69]) of the relationships between variables even while preserving the privacy of the dataset. Our study also contrasts the claim in [69] that “differential privacy provides either very little privacy or very little utility or neither”, the thesis in [72] that techniques for privacy preservation have “a noticeable impact of privacy-preservation techniques in predictive performance”, and the concerns raised in [5] with respect to the use of differential privacy by the 2020 U.S. Census that “transition to differential privacy has raised a number of questions about the proper balance between privacy and accuracy in official statistics.”

Conclusion and future directions

Advances in computing power have enabled unparalleled opportunity for obtaining insights using granular data, especially those on individuals, to guide corporate and public policy. This trend is also accompanied by the increasing importance that society places on individuals' privacy, thereby creating an intrinsic trade-off between the utility of datasets and privacy of individuals that comprise such data. Existing literature highlights this trade-off even for one of the newest concept in privacy—differential privacy. In contrast to such existing literature, our study shows that differential privacy can be employed to precisely—not approximately—retrieve the associations in the original dataset provided the noise addition satisfies certain criteria.

Given the promise of differential privacy in preserving the privacy of individuals’ data, a follow up to our study could be to study the techniques through which noise can be added to satisfy differential privacy as well as the criteria that are outlined in this study, especially adding noise that is purely random. Another important follow up study would be to analyze whether the results that we have demonstrated for analysis using ordinary least squares (OLS) regression extend to other analytical techniques, such as those using artificial intelligence and machine learning.