Abstract
Protecting the privacy of datasets has become hugely important these days. Many real-life datasets like income data and medical data need to be secured before making it public. However, security comes at the cost of losing some useful statistical information about the dataset. Data obfuscation deals with this problem of masking a dataset in such a way that the utility of the data is maximized while minimizing the risk of the disclosure of sensitive information. Two popular approaches to data obfuscation for numerical data involve (i) data swapping and (ii) adding noise to data. While the former masks well sacrificing the whole of correlation information, the latter gives estimates for most of the popular statistics like mean, variance, quantiles and correlation but fails to give an unbiased estimate of the distribution curve of the original data. In this paper, we propose a mixed method of obfuscation combining the above two approaches and discuss how the proposed method succeeds in giving an unbiased estimation of the distribution curve while giving reliable estimates of the other well-known statistics like moments and correlation.
This is a preview of subscription content, access via your institution.


References
Steinberg J, Pritzker L (1967) Some experiences with and reactions on data linkage in the United States. Bulletin of the International Statistical Institute, pp 786–808
Bachi R, Baron R (1969) Confidentiality problems related to data banks. Bulletin of the International Statistical Institute, vol 43, pp 225–241
Dalenius T (1974) The invasion of privacy problem and statistics production—an overview. Statistisk Tidskrzft 12:213.A–225.A
Dalenius T (1977) Computers and individual privacy some international implications. Bulletin of the International Statistical Institute, vol 47, pp 203–211
Mugge RH (1983) Issues in protecting confidentiality in national health statistics. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, pp 592–594
Fienberg SE (1994) Conflict between the needs for access to statistical information and demands for confidentiality. J Off Stat 10(2):115–132
Trabelsi S, Salzgeber V, Bezzi M, Montagnon G (2009) Data disclosure risk evaluation. IEEE Xplore. https://doi.org/10.1109/CRISIS.2009.5411979
Fuller WA (1993) Masking procedures for microdata disclosure limitation. J Off Stat 9:383–406
Fuller WA (1987) Measurement error models. Wiley, New York
Dalenius T, Reiss SP (1982) Data-swapping: a technique for disclosure control. J Stat Plan Inference 6:73–85
Moore RA (1996) Controlled data swapping techniques for masking use microdata sets. US Bureau of the Census, Statistical Research Division. http://www.census.gov/srd/www/byyear.html
Sarathy R, Muralidhar K, Parsa R (2002) Perturbing non-normal confidential attributes: the copula approach. Manag Sci 48(12):1613–1627
Ghatak D, Roy B (2018) Estimation of true quantiles from quantitative data obfuscated with additive noise. J Off Stat 34:671–694
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Theorem 2.1
Proof
To find the raw moments of X, denoted by \(\mu _{(X,k)}\), in terms of moments of Z, we first need to check that the moments of Z exist whenever that of X does. kth absolute raw moment of \(Z_i\) is given by
since \(E|X_j|^k <\infty \), and \(E|Y_j|^k =\frac{\sigma ^k 2^\frac{k}{2}\Gamma (\frac{k+1}{2})}{\sqrt{\pi }} < \infty \) \(\forall k \in {\mathbb {N}}\).
Thus, the moments of Z exist if that of X exists. Now, we try to find the estimates of moments of X in terms of moments of Z.
The above equation follows since odd order moment of \(Y_i\) is zero.
Now, \(E[Z]=p\cdot E[X] + (1-p) \cdot E[X+Y] =E[X]\). \(\therefore E[\frac{1}{n}\sum _{j=1}^n{Z_j}]=E[Z]=E[X]\).
\(\therefore \frac{1}{n}\sum _{j=1}^n{Z_j}\) is an unbiased estimator of E[X].
Define, in general,
If \(E[{\hat{\mu }}_{(X,j)}]=\mu _{(X,j)}\) , \(\forall j=1,2, \ldots k,\) then
Thus, by induction, \({\hat{\mu }}_{(X,k)}\) is an unbiased estimator for \({\mu }_{(X,k)}\).
Also, note that \({\hat{\mu }}_{(X,k)}=\frac{1}{n}\sum _{j=1}^n{f(Z_j)}\) , where f(.) is a polynomial function of finite degree. Also, \({\text{Var}}(Z_j)=E(Z^{2j})-E(Z_j)^2 < \infty \) if the moments of X and hence Z exist and is finite \(\forall n \in {\mathbb {N}}\). Thus, \({\text{Var}}({\hat{\mu }}_{(X,k)})=O(\frac{1}{n})\) and \({\hat{\mu }}_{(X,k)}\) is also a consistent estimator of \(\mu _{(X,k)}\). \(\square \)
Now, we state and prove the following lemma, which we will require while proving Theorems 2.2 and 2.3 in subsequent sections.
Lemma 6.1
For \((x_1,x_2,x_3) \in {\mathbb {R}}^3\) and \((\sigma _1,\sigma _2) \in \mathbb {R^+}^2\),
where \(\phi _\sigma (x)=\) normal density at \(x \in {\mathbb {R}}\) for mean 0 and standard deviation \(\sigma. \)
Proof
To prove the given lemma we consider,
\(\square \)
1.2 Proof of Theorem 2.2
Proof
Let H(.) be the c.d.f. of Z. Then,
If \({\tilde{G}}(x)=p\cdot G(x)\), then we have the equation
To find a solution to Eq. (4), note that this is a Fredholm equation of second kind and a solution for such a problem for \(|\lambda |<1\) is given by the famous Liouville-Neumann series,
\(u_t(x)=\int _{-\infty }^{\infty } \ldots \int _{-\infty }^{\infty } {\phi _\sigma (x-x_1)\cdot \phi _\sigma (x_2-x_1)\cdots \phi _\sigma (x_{t-1}-x_t)H(x_t){\mathrm{d}}x_1 {\mathrm{d}}x_2 \ldots {\mathrm{d}}x_t}\)
Now, H(x) is unknown, so we use \({\hat{H}}(x)=\frac{1}{n}\sum _{j=1}^n{{\mathbb {I}}_{[Z_j \le x]}}\) instead. Here, \({\mathbb {I}}_A\) is the indicator function for event A.
For \(t \in {\mathbb {N}}\), the integrand in \({\hat{u}}_t(x)\) is integrable since,
Also,
Thus, \(\hat{{\tilde{G}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma \sqrt{t}}(x-Z_j)}}\) is an estimator of \({\tilde{G}}(x)\).
Thus, \(\frac{{\hat{G}}(x)}{p}\) is an unbiased estimator for G(x). \(\square \)
1.3 Proof of Theorem 2.3
Proof
Note that in Eq. (4), the integral term is a convolution of two functions and hence can be easily interchanged. Thus, taking derivatives on both sides, we have
where \({\tilde{g}}(x)=\frac{{\mathrm{d}}}{{\mathrm{d}}x}\{{\tilde{G}}(x)\}=pg(x)\) and g(x) and h(x) are density of X and Z, respectively.
To find a solution to Eq. (6), note that this is a Fredholm Equation of second kind and a solution for such a problem for \(|\lambda |<1\) is given by the Liouville-Neumann Series,
Now, h(x) is unknown, so we use \({\hat{h}}(x)=\frac{1}{n}\sum _{j=1}^{nb}{K(\frac{x-Z_j}{b})}\) instead. Here, K(.) is the Gaussian Kernel function and b is the bandwidth selected by Silvermans rule of thumb.
For \(t \in {\mathbb {N}}\), the integrand in \({\hat{u}}_t(x)\) is integrable in a similar reason as in the proof of Theorem 2.1. Also,
Thus, \(\hat{{\tilde{g}}}(x)= \sum _{t=0}^\infty {\lambda ^t {\hat{u}}_t(x)}= \frac{1}{n}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \phi _{\sqrt{t\sigma ^2 + b^2}}(x-Z_j)}}\) is an estimator of \({\tilde{g}}(x),\) and hence,
is an estimator for G(x).
Note that this estimator is a linear series of normal c.d.f.s each with positive variance and hence are smooth functions, which makes the function \({\hat{G}}(x)\), a smooth function. \(\square \)
1.4 Proof of Theorem 2.4
Proof
We have \(T_i(x)=\frac{1}{np}\sum _{j=1}^n {\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)}}\) where \(\sigma _t=\sigma \sqrt{t}\) for \(T_1(x)\) and \(\sqrt{t\sigma ^2 + b^2}\) for \(T_b(x)\). Since \(\Phi _{\sigma _t}(x-Z_j) \le 1\) \(\forall t \in {\mathbb {N}}\) and \(\sum _{t=0}^\infty {\lambda ^t}=p\), \(\frac{1}{p}\sum _{t=0}^\infty {\lambda ^t \Phi _{\sigma _t}(x-Z_j)} = T_{ij} \; ({\text{say}})\le 1\).
\(T_i(x)=\frac{1}{n}\sum _{j=1}^n{T_{ij}}\), i.e., average of n i.i.d. random variables each of whose value is less than or equal to 1. Thus,
Since \(T_1(x)\) is unbiased, the last equation implies convergence in probability of \(T_1(x)\) to its expected value G(x). For the smooth estimator,
The true c.d.f. can be written as (using Equation 4),
Since \(\int _{-\infty }^{\infty }{\phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot H(x_t){\mathrm{d}}x_t}=\int _{-\infty }^{\infty }{\Phi _{\sqrt{t\sigma ^2}}(x-x_t)\cdot h(x_t){\mathrm{d}}x_t},\) both being c.d.f. of \(Z+N(0,\sqrt{t\sigma ^2})\).
Thus, we have an expression for the bias at point x, denoted by B(x), as given below.
where \(H^{\star }(x)\) is the c.d.f. of \(Z+N(0,b^2)\), Z is a r.v. with p.d.f. h(.). As \(n \rightarrow \infty \), \(b \rightarrow 0\) and \(N(0,b^2) \overset{P}{\rightarrow } 0\). Thus, by Slutsky’s theorem, \(Z+N(0,b^2) \Longrightarrow Z\) in distribution as \(b \rightarrow 0\), or, \(|H(x)-H^{\star }(x)| \rightarrow 0\) as \(n \rightarrow \infty \).
For \(\textsf {Tr}_2\), since \(\Phi (\frac{x-z}{\sqrt{y}})\) is a continuous differentiable function in y, expanding the function by Taylor series at \(y=t\sigma ^2,\) we have \(|{\Phi (\frac{x-z}{\sqrt{t\sigma ^2 +b^2}})}-\Phi (\frac{x-z}{\sqrt{t\sigma ^2}})|=|(t\sigma ^2+b^2 - t\sigma ^2)(\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{3/2}})|\) where \(y^*\) is a point between \(t\sigma ^2\) and \(t\sigma ^2 +b^2\). Also, it is easy to show that \(|x\phi (x)| \le 1\) for all \(x \in {\mathcal {R}}\)( as discussed in Note 5.1), which implies \(|\phi (\frac{x-z}{\sqrt{y*}})\frac{x-z}{y*^{1/2}}| \le 1\). Thus,
Thus, \(MSE(T_b(x))={\text{Var}}(T_b(x))+Bias(x)^2 \rightarrow 0,\; {\text{as}}\; n \rightarrow \infty \). Hence, we have \(T_b(x) \overset{{\mathbb {L}}_2}{\longrightarrow } G(x) , \forall x {\text{as}}\;n \rightarrow \infty \) which implies the result. \(\square \)
Note 5.1 For \(|x| \le 1\), \(\frac{1}{\sqrt{2\pi }}<1\), \(e^{-x^2/2}\le 1\) implies \(|x\phi (x)|<1\). For \(|x|>1\), since \(|x\phi (x)|=|x|\phi (|x|)\), it can be written as \(\frac{1}{\sqrt{2\pi }}\frac{|x|}{e^{|x|^2/2}}=\frac{1}{\sqrt{2\pi }}\frac{1}{\frac{1}{|x|}+\frac{|x|}{2} + \delta } \le \frac{2}{\sqrt{2\pi }}<1\), where \(\delta > 0\).
1.5 Proof of Theorem 2.5
Proof
For \(1 \le i \le n\), we have
Thus, \({\hat{Cov}}(Z,X^{\prime })=\frac{1}{n}\cdot \sum _{j=1}^n[{Z_j\cdot X^{\prime }_j}]-{\bar{Z}}\cdot \bar{X^{\prime }}\) exists and hence is a consistent estimator of the true covariance between Z and \(X^{\prime }\).
Thus, \({\hat{Cov}}(Z,X^{\prime })\) is a consistent estimator of \((1-p). Cov(X,X^{\prime })\), or \(\frac{1}{1-p}{\hat{Cov}}(Z,X^{\prime })\) is a consistent estimator of \(Cov(X,X^{\prime })\). Also, we have seen in Theorem 2.1, Var(Z) exists if Var(X) does. Thus, \(\sqrt{{\hat{Var}}(Z)} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(Z)}\) and since \({\text{Var}}(X^{\prime })\) exists, \(\sqrt{{\hat{Var}}(X^{\prime })} \overset{P}{\longrightarrow } \sqrt{{\text{Var}}(X^{\prime })}\). Hence, by property of convergence in probability, the result follows. \(\square \)
Rights and permissions
About this article
Cite this article
Ghatak, D., Roy, B.K. Conditional Masking to Numerical Data. J Stat Theory Pract 13, 44 (2019). https://doi.org/10.1007/s42519-019-0042-y
Published:
DOI: https://doi.org/10.1007/s42519-019-0042-y
Keywords
- Data obfuscation
- Quantile estimation
- Privacy protection
- Masking numerical data-sets