PrivPfC: differentially private data publication for classification

Su, Dong; Cao, Jianneng; Li, Ninghui; Lyu, Min

doi:10.1007/s00778-017-0492-3

PrivPfC: differentially private data publication for classification

Regular Paper
Published: 02 February 2018

Volume 27, pages 201–223, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Dong Su ORCID: orcid.org/0000-0002-1410-2468¹,
Jianneng Cao²,
Ninghui Li¹ &
…
Min Lyu³

1096 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, we tackle the problem of constructing a differentially private synopsis for the classification analysis. Several state-of-the-art methods follow the structure of existing classification algorithms and are all iterative, which is suboptimal due to the locally optimal choices and division of the privacy budget among many sequentially composed steps. We propose PrivPfC, a new differentially private method for releasing data for classification. The key idea underlying PrivPfC is to privately select, in a single step, a grid, which partitions the data domain into a number of cells. This selection is done by using the exponential mechanism with a novel quality function, which maximizes the expected number of correctly classified records by a histogram classifier. PrivPfC supports both the binary classification and the multiclass classification. Through extensive experiments on real datasets, we demonstrate PrivPfC ’s superiority over the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Differentially private nearest neighbor classification

Article 21 July 2017

PEM: A Practical Differentially Private System for Large-Scale Cross-Institutional Data Mining

Local Differential Privacy for Private Construction of Classification Algorithms

References

Asuncion, A., Newman, D.: UCI machine learning repository (2010)
Bayardo, R.J., Agrawal, R.: Data privacy through optimal $k$-anonymization. In: ICDE, pp. 217–228 (2005)
Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, Secaucus (2006)
MATH Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS, pp. 128–138 (2005)
Chang, C.C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Article Google Scholar
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: NIPS, pp. 289–296 (2008)
Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12, 1069–1109 (2011)
MathSciNet MATH Google Scholar
Cormode, G., Srivastava, D., Li, N., Li, T.: Minimizing minimality and maximizing utility: analyzing method-based attacks on anonymized data. PVLDB 3(1–2), 1045–1056 (2010)
Google Scholar
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics. Springer, Berlin (1996)
Book MATH Google Scholar
Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: CCS, pp. 1322–1333 (2015)
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: USENIX Security Symposium, pp. 17–32 (2014)
Friedman, A., Schuster, A.: Data mining with differential privacy. In: KDD, pp. 493–502 (2010)
Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: ICDE, pp. 205–216 (2005)
Geng, X., Liu, T.Y., Qin, T., Li, H.: Feature selection for ranking. In: SIGIR, pp. 407–414 (2007)
Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using dpbench. In: SIGMOD, pp. 139–154 (2016)
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: KDD, pp. 279–288 (2002)
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A practical differentially private random decision tree classifier. Trans. Data Priv. 5, 273–295 (2012)
MathSciNet Google Scholar
Kotz, S., Kozubowski, T., Podgorski, K.: The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Springer, Berlin (2001)
Book MATH Google Scholar
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Incognito: efficient full-domain $k$-anonymity. In: SIGMOD, pp. 49–60 (2005)
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional $k$-anonymity. In: ICDE, p. 25 (2006)
Lei, J.: Differentially private m-estimators. In: NIPS, pp. 361–369 (2011)
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS, pp. 94–103 (2007)
Minnesota Population Center: Integrated Public Use Microdata Series, International: Version 6.5 [dataset]. University of Minnesota, Minneapolis (2017). https://doi.org/10.18128/D020.V6.5
Mohammed, N., Chen, R., Fung, B.C.M., Yu, P.S.: Differentially private data release for data mining. In: KDD, pp. 493–501 (2011)
Qardaji, W., Yang, W., Li, N.: Differentially private grids for geospatial data. In: ICDE, pp. 757–768 (2013)
Qardaji, W., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB 6(14), 1954–1965 (2013)
Google Scholar
Ruggles, S., Genadek, K., Goeken, R., Grover, J., Sobek, M.: Integrated Public Use Microdata Series: Version 7.0 [dataset]. University of Minnesota, Minneapolis (2017). https://doi.org/10.18128/D010.V7.0
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22–26, 2017, pp. 3–18 (2017)
Su, D., Cao, J., Li, N., Bertino, E., Jin, H.: Differentially private k-means clustering. In: Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy, CODASPY 2016, New Orleans, LA, USA, March 9–11, 2016, pp. 26–37 (2016)
Su, D., Cao, J., Li, N., Bertino, E., Lyu, M., Jin, H.: Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans. Priv. Secur. 20(4), 16:1–16:33 (2017)
Article Google Scholar
Therneau, T.M., Atkinson, B.: Package: rpart (2014). http://cran.r-project.org/web/packages/rpart/rpart.pdf
Vinterbo, S A.: Differentially private projected histograms: construction and use for prediction. In: ECML PKDD’12, pp. 19–34 (2012)
Wong, R.C.W., Fu, A. W. C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: VLDB, pp. 543–554 (2007)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. In: SIGMOD ’14, pp. 1423–1434 (2014)
Zhang, J., XiaoXia, X., Yang, Y., Zhang, Z., Winslett, M.: Privgene: differentially private model fitting using genetic algorithms. In: SIGMOD ’13, pp. 665–676 (2013)
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. PVLDB 5(11), 1364–1375 (2012)
Google Scholar

Download references

Acknowledgements

We thank the reviewers for their valuable comments. This paper is based upon work supported by the United States National Science Foundation under grants CNS-1116991 and CNS-1640374 and Key Laboratory on High Performance Computing, Anhui Province, NSFC (61672486, 61672480,11671376), Key Program of NSFC (71631006).

Author information

Authors and Affiliations

Purdue University, West Lafayette, IN, USA
Dong Su & Ninghui Li
Institute for Infocomm Research, Singapore, Singapore
Jianneng Cao
University of Science and Technology of China, Hefei, China
Min Lyu

Authors

Dong Su
View author publications
You can also search for this author in PubMed Google Scholar
Jianneng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Ninghui Li
View author publications
You can also search for this author in PubMed Google Scholar
Min Lyu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Su.

Appendix

Lemma 3 gives the distribution of the difference of two i.i.d. Laplace random variables, which will be used in the proof of Lemma 1.

Lemma 3

[20] Let $Z_1$ and $Z_2$ be two i.i.d. random variables that follow the Laplace distribution with mean 0 and scale $\frac{1}{\epsilon }$. Then the density of their difference $Y = Z_1 - Z_2$ is

$$\begin{aligned} f_{Y}(y) = \frac{\epsilon }{4} e^{-\epsilon |y|}(1 + \epsilon |y|)\quad \quad -\infty< y < \infty , \end{aligned}$$

and the corresponding cumulative distribution function is

(8)

Proof of Lemma 1

Proof

We show the global sensitivity of the grid quality (Eq. 1) in the binary classification setting can be safely bounded by 1.1.

Given a dataset D, and without loss of generality we assume that D has 2 class labels $\{1,2\}$ and the neighboring dataset of D is $D^{\prime } = D - t$ , where the tuple t falls into cell e and has class label 1. Given a grid g, the quality values of all cells of it are the same except the cell e. In the cell e, we have the number of data points with label 1, $n_{e}^{1}$, ${n'}_{e}^{1}$, and that with label 2, $n_{e}^{2}$, ${n'}_{e}^{2}$ for D and $D^{\prime }$, respectively, where $n_{e}^{1}={n'}_{e}^{1}+1$ and $n_{e}^{2}={n'}_{e}^{2}$. Each class also has a probability, $p_{e}^{i}$, being the dominant class in cell e, where $i = \{1,2\}$.

To compute the global sensitivity of the grid quality (Eq. 1), we first present the difference of the grid quality function for the neighboring datasets D and $D'$ in terms of $n_e^1, n_e^2, p_e^1$ and $p_e^2$. Then we show that the difference can be bounded by 1.1.

The global sensitivity of the grid quality function for binary classification can be computed by,

$$\begin{aligned} \varDelta _{\textsf {gq}}&= |\textsf {gq} (D, g)-\textsf {gq} (D', g)| \nonumber \\&= \left| \left( n_{e}^{1} p_{e}^{1} + n_{e}^{2} p_{e}^{2}\right) -\left( n_{e}^{'1} p_{e}^{'1} + n_{e}^{'2} p_{e}^{'2}\right) \right| \nonumber \\&= \left| \left( n_{e}^{1} p_{e}^{1} + n_{e}^{2} p_{e}^{2}\right) -\left( \left( n_{e}^{1}-1\right) p_{e}^{'1} + n_{e}^{2} p_{e}^{'2}\right) \right| \nonumber \\&\quad \text{ since } {n'}_{e}^{1}=n_{e}^{1}-1, {n'}_{e}^{2}=n_{e}^{2} \nonumber \\&= \left| p_e^1+\left( n_e^{1}-1\right) \left( p_e^1-p_e^{'1}\right) +n_e^{2}\left( p_e^2-p_e^{'2}\right) \right| \nonumber \\&= \left| p_e^1+\left( n_e^{1}-1\right) \left( p_e^1-p_e^{'1}\right) +n_e^{2}\left( \left( 1-p_{e}^{1}\right) -\left( 1-p_e^{'1}\right) \right) \right| \nonumber \\&= \left| p_e^1+\left( n_e^{1}-n_e^2-1\right) \left( p_e^1-p_e^{'1}\right) \right| , \end{aligned}$$

(9)

where $p_e^1$ is the probability of Class 1 is still the dominant class after adding noise. The last equality holds, because $p_e^2=1-p_e^1$ and $p_{e}^{'2}=1-p_{e}^{'1}$.

As for $p_e^1$, by Lemma 3,

(10)

Since the probability $p_e^1$ (Eq. 10) takes different forms depending on whether $n_e^1-n_e^2 \ge 0$ or not, we analyze the global sensitivity $\varDelta _{\textsf {gq}}$ by two cases: Class 1 is the dominant class in the cell e for D, $n_e^1-n_e^2 \ge 1$ or not, $n_e^1-n_e^2 \le 0$. We show that for $n_e^1-n_e^2 \ge 1$, the upper bound of the global sensitivity is 1.1 and for $n_e^1-n_e^2 \le 0$, the upper bound is 0.5. Note that the two conditions cover all cases, since $n_e^1$ and $n_e^{2}$ are integers. Therefore, the global sensitivity can be bounded by 1.1.

Case 1: ${\mathbf {n_e^1-n_e^2 \ge 1}}$.

In this case, $n_e^{'1}-n_e^{'2} = n_e^1-1-n_e^2 \ge 0$. By Eqs. (9) and (10), we have

$$\begin{aligned} \varDelta _{\textsf {gq}}&= \biggl |1-\frac{e^{-\epsilon (n_e^1-n_e^2)}}{2}\left( 1+\frac{\epsilon \left( n_e^1-n_e^2\right) }{2}\right) + \\&\left( n_e^{1}-n_e^2-1\right) \biggl [\frac{e^{-\epsilon (n_e^{'1}-n_e^{'2})}}{2}\left( 1+\frac{\epsilon \left( n_e^{'1}-n_e^{'2}\right) }{2}\right) \\&-\frac{e^{-\epsilon (n_e^1-n_e^2)}}{2}\left( 1+\frac{\epsilon \left( n_e^1-n_e^2\right) }{2}\right) \biggr ]\biggr |. \end{aligned}$$

By letting $x = n_e^1-n_e^2$, we have

$$\begin{aligned} \varDelta _{\textsf {gq}}&= \biggl |1-\frac{e^{-\epsilon x}}{2} + (x-1)\biggl [\frac{e^{-\epsilon (x-1)}}{2}\left( 1+\frac{\epsilon (x-1)}{2}\right) \\&\quad \left( 1+\frac{\epsilon x}{2}\right) -\frac{e^{-\epsilon x}}{2}\left( 1+\frac{\epsilon x}{2}\right) \biggr ]\biggr | \\&= \biggl |1+\frac{(x-1)e^{-\epsilon (x-1)}}{2}\left( 1+\frac{\epsilon (x-1)}{2}\right) -\frac{xe^{-\epsilon x}}{2}\left( 1+\frac{\epsilon x}{2}\right) \biggr |, \end{aligned}$$

where $x \ge 1$.

For simplicity, let us consider the function $g_{1}(x)=\frac{xe^{-\epsilon x}}{2}(1+\frac{\epsilon x}{2})$, and thus the sensitivity becomes $ \varDelta _{\textsf {gq}} = \left| 1+g_{1}(x-1)-g_{1}(x)\right| $.

Note that $g_{1}(x)$ is differentiable and its derivative $g_{1}^{\prime }(x)=\frac{e^{-\epsilon x}}{4}(2-\epsilon ^2 x^2)$. Thus, by Lagrange’s Mean Value Theorem, for any $x \ge 1$, there exists some $\xi $ between $x-1$ and x (thus $\xi >0$), so that

$$\begin{aligned} \varDelta _{\textsf {gq}}&= \left| 1+g_{1}(x-1)-g_{1}(x)\right| \\&= \left| 1-g_{1}^{\prime }(\xi )\right| \\&= \left| 1-\frac{e^{-\epsilon \xi }}{4}\left( 2-\epsilon ^2 \xi ^2\right) \right| . \end{aligned}$$

To bound the expression above, consider another function

$$\begin{aligned} h(x)=1-\frac{e^{-\epsilon x}}{4}\left( 2-\epsilon ^2 x^2\right) , \end{aligned}$$

where $x>0$. The function h(x) reaches the maximum at the point $\frac{1+\sqrt{3}}{\epsilon }$ with the maximum value 1.1, increases in the interval $\left[ 0, \frac{1+\sqrt{3}}{\epsilon }\right] $ and decreases in the interval $\left( \frac{1+\sqrt{3}}{\epsilon }, \infty \right) $. When $x \in [0, \frac{1+\sqrt{3}}{\epsilon }]$, $h(0) \le h(x) \le h(\frac{1+\sqrt{3}}{\epsilon })$ which means $h(x) \in [0.5, 1.1]$. When $x \in \left( \frac{1+\sqrt{3}}{\epsilon }, \infty \right) $, h(x) decreases and lies in (1, 1.1], because $\lim \nolimits _{x\rightarrow +\infty } h(x)=1$. Therefore, in this case,

$$\begin{aligned} \varDelta _{\textsf {gq}} = |h(\xi )| \in [0.5, 1.1]. \end{aligned}$$

Case 2: ${\mathbf {n_e^1-n_e^2 \le 0}}$

In this case,

$$\begin{aligned} n_e^{'1}-n_e^{'2} = n_e^1-1-n_e^2 < 0. \end{aligned}$$

Similarly, by letting $x = n_e^1-n_e^2$, we have

$$\begin{aligned} \varDelta _{\textsf {gq}}&= \biggl |\frac{e^{\epsilon (n_e^1-n_e^2)}}{2}\left( 1-\frac{\epsilon \left( n_e^1-n_e^2\right) }{2}\right) + \\&\quad \left( n_e^{1}-n_e^2-1\right) \biggl [\frac{e^{\epsilon (n_e^1-n_e^2)}}{2}\left( 1-\frac{\epsilon \left( n_e^1-n_e^2\right) }{2}\right) - \\&\quad \frac{e^{\epsilon (n_e^1-n_e^2-1)}}{2}\left( 1-\frac{\epsilon \left( n_e^1-n_e^2-1\right) }{2}\right) \biggr ]\biggr | \\&= \biggl |\frac{e^{\epsilon x}}{2}\left( 1-\frac{\epsilon x }{2}\right) + (x-1) \\&\quad \left[ \frac{e^{\epsilon x }}{2}\left( 1-\frac{\epsilon x }{2}\right) -\frac{e^{\epsilon (x-1)}}{2}\left( 1-\frac{\epsilon (x-1)}{2}\right) \right] \biggr | \\&= \left| \frac{xe^{\epsilon x }}{2}\left( 1-\frac{\epsilon x }{2}\right) -\frac{(x-1)e^{\epsilon (x-1)}}{2}\left( 1-\frac{\epsilon (x-1)}{2}\right) \right| , \end{aligned}$$

where $x \le 0$.

Similarly to Case 1, let $g_2(x)=\frac{xe^{\epsilon x }}{2}\left( 1-\frac{\epsilon x }{2}\right) $, and then the sensitivity becomes $\varDelta _{\textsf {gq}} = \left| g_{2}(x)-g_{2}(x-1)\right| $.

The function $g_2(x)$ is differentiable and its first order derivative is $g_{2}^{\prime }(x)=\frac{e^{\epsilon x}}{4}\left( 2-\epsilon ^2 x^2\right) $. The derivative $g_{2}^{\prime }(x)$ decreases when $x \in \left( -\infty , -\frac{1+\sqrt{3}}{\epsilon }\right) $, increases when $x\in \left[ -\frac{1+\sqrt{3}}{\epsilon }, 0\right] $. And when $x=-\frac{1+\sqrt{3}}{\epsilon }$ the function $g_{2}^{\prime }(x)$ reaches the minimum value $-0.09$. Thus, when $x \le -\frac{1+\sqrt{3}}{\epsilon }$, $g_{2}^{\prime }(x) \in [-0.09, 0)$ because $\lim \limits _{x\rightarrow -\infty } g_{2}^{\prime }(x)=0$ and $g_{2}^{\prime }(x) \in [-0.09, 0.5]$ when $x\in [-\frac{1+\sqrt{3}}{\epsilon }, 0]$. Applying Lagrange’s Mean Value Theorem to $g_{2}(x)$, for any $x \le 0$, there exists some $\eta $ between $x-1$ and x, thus $\eta \le 0$, so that

$$\begin{aligned} \varDelta _{\textsf {gq}}&= \left| g_{2}(x)-g_{2}(x-1)\right| \\&= \left| g_{2}^{\prime }(\eta )\right| \\&\le 0.5. \end{aligned}$$

In summary, by considering the above two cases, the global sensitivity for grid quality on binary classification $\varDelta _{\textsf {gq}}$ is bounded by 1.1 and reaches its maximum when Case 1 holds, at

$$\begin{aligned} x^{*} = \frac{\epsilon e^{\epsilon } + \sqrt{2 - \left( 4 - 2e^{\epsilon }\right) e^{\epsilon } + \epsilon ^{2}e^{\epsilon } }}{-\epsilon + \epsilon e^{\epsilon }}. \end{aligned}$$

The global maximum point $x^{*}$ is obtained by taking derivative of $1+g_{1}(x-1)-g_{1}(x)$. This completes the proof. $\square $

Proof of Lemma 2

Proof

Given a dataset D, without loss of generality we assume that the neighboring dataset of D is $D^{\prime } = D - t$, where the tuple t is in cell e. The quality values of all cells other than cell e are the same. Denote $l_t$ as the class label of t.

Since the approximation of the grid quality (Eq. 5) involves only the top two classes, and the quality values of all cells other than cell e are the same, the difference of the approximated grid quality function $\varDelta _{\textsf {gq}}=|\textsf {gq} (D, g)-\textsf {gq} (D', g)|$ is 0 or not depends on whether the count of Class $l_t$ is in the top two class counts of cell e in D and $D'$. It is worth pointing out that the rank of Class $l_t$ does not rise in $D'= D-t$ because deleting the tuple t can only decrease the count of Class $l_t$. Therefore, we bound $\varDelta _{\textsf {gq}}$ by considering three separated cases: (1) Class $l_t$ is not in the top two classes in D and $D'$, (2) Class $l_t$ is in the top two classes in D and $D'$, and (3) Class $l_t$ is in the top two classes in D but not in $D'$. For convenience, we use $l_t \in \{(1), (2)\}$ to represent the fact that the class count of lt is in top 2, and $l_t \notin \{(1), (2)\}$ to represent that it is not.

Case 1: $l_t \notin \{(1),(2)\}$ in both D and $D'$. Class $l_t$ does not rank in top-2 in both D and $D'$. In this case, deleting the tuple t does not affect the first 2 classes. $n_e^{(1)}=n_e^{'(1)}, n_e^{(2)}=n_e^{'(2)}, p_e^{(1)}=p_e^{'(1)}$ and $p_e^{(2)}=p_e^{'(2)}$. So $\textsf {gq} (D,g)=\textsf {gq} (D',g)$, which means $\varDelta _{\textsf {gq}}=0$.

Case 2: $l_t \in \{(1),(2)\}$ in both D and $D'$. Class $l_t$ ranks in top-2 in both D and $D'$. In this case, the rank of $l_t$ may change. So let us consider the following subcases.

Subcase 2.1: Class $l_t$ ranks first (resp. second) in both D and $D'$. Then, we have $n_e^{(1)}=n_e^{'(1)}+1$ and $n_e^{(2)}=n_e^{'(2)}$ (resp. $n_e^{(1)}=n_e^{'(1)}$ and $n_e^{(2)}=n_e^{'(2)}+1$ ). Similar to the proof of Lemma 1, we obtain $\varDelta _{\textsf {gq}} \le 1.1$.

Subcase 2.2: Class $l_t$ ranks first in D and ranks second in $D'$. Deleting one tuple makes the class with the highest count become the second highest class, which occurs only when there is a tie between two highest classes in D. (This tie can be resolved the alphabetical order of class labels.) In this case, we have

$$\begin{aligned} n_e^{(1)}=n_e^{(2)}, n_e^{'(1)}=n_e^{(2)}, n_e^{'(2)}=n_e^{(1)}-1. \end{aligned}$$

(11)

Thus,

$$\begin{aligned} \varDelta _{\textsf {gq}}&= |\textsf {gq} (D, g)-\textsf {gq} (D', g)| \\&= \left| \left( n_{e}^{(1)} p_{e}^{(1)} + n_{e}^{(2)} p_{e}^{(2)}\right) -\left( n_{e}^{'(1)} p_{e}^{'(1)} + n_{e}^{'(2)} p_{e}^{'(2)}\right) \right| \\&= \left| n_{e}^{(1)}\left( p_{e}^{(1)}+p_{e}^{(2)}\right) - \left( n_{e}^{(1)}p_{e}^{'(1)}+\left( n_{e}^{(1)}-1\right) p_{e}^{'(2)}\right) \right| \\&= \left| n_{e}^{(1)}-\left( n_{e}^{(1)}-p_{e}^{'(2)}\right) \right| \\&= \left| p_{e}^{'(2)}\right| \le 1, \end{aligned}$$

where the third equality holds because of Eq. (11) and the fourth equality holds because $p_{e}^{(2)}=1-p_{e}^{(1)}$ and $p_{e}^{'(2)}=1-p_{e}^{'(1)}$.

Case 3: $l_t \in \{(1),(2)\}$ in D and $l_t \notin \{(1),(2)\}$ in $D'$. Class $l_t$ ranks in top-2 in D, but does not rank in top-2 in $D'$. Similar to Subcase 2.2, this case occurs only when there is a tie among Class $l_t$ and other $s (\ge 1)$ classes which are not ranked in top-2. Deleting the tuple t makes $l_t$ ranked out of top-2. So, by our tie resolving rule, one of the s classes with the same count as Class $l_t$ is ranked into top-2. For example, suppose the number of class $k =3$, $n_{e}^{1} = n_{e}^{2} = n_{e}^{3} = 10$, the first highest classes with counts $n_{e}^{(1)}=n_{e}^{(2)}=10$. After removing the tuple t with label 1, we have $n_{e}^{\prime 2} = n_{e}^{\prime 3} = 10$, and $n_{e}^{\prime 1} = 9$. The class containing t is ranked out of top-2 and the class with label 3 is ranked into top-2, and the counts of the top two classes in $D'$ are still the same, which means $n_{e}^{'(1)}=n_{e}^{'(2)}=10$. That is to say,

$$\begin{aligned} n_e^{(1)}=n_e^{(2)}, n_e^{'(1)}=n_e^{(1)}, n_e^{'(2)}=n_e^{(1)}. \end{aligned}$$

(12)

Similarly, we have

$$\begin{aligned} \varDelta _{\textsf {gq}}&= |\textsf {gq} (D, g)-\textsf {gq} (D', g)| \\&= \left| \left( n_{e}^{(1)} p_{e}^{(1)} + n_{e}^{(2)} p_{e}^{(2)}\right) -\left( n_{e}^{'(1)} p_{e}^{'(1)} + n_{e}^{'(2)} p_{e}^{'(2)}\right) \right| \\&= \left| n_{e}^{(1)}\left( p_{e}^{(1)}+p_{e}^{(2)}\right) - \left( n_{e}^{(1)}p_{e}^{'(1)}+n_{e}^{(1)}p_{e}^{'(2)}\right) \right| \\&= \left| n_{e}^{(1)}-n_{e}^{(1)}\right| =0, \end{aligned}$$

where the second equality holds because of Eq. (12).

In summary, the global sensitivity for the approximation of grid quality on multiclass classification $\varDelta _{\textsf {gq}} \le 1.1$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, D., Cao, J., Li, N. et al. PrivPfC: differentially private data publication for classification. The VLDB Journal 27, 201–223 (2018). https://doi.org/10.1007/s00778-017-0492-3

Download citation

Received: 25 September 2016
Revised: 19 November 2017
Accepted: 14 December 2017
Published: 02 February 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s00778-017-0492-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PrivPfC: differentially private data publication for classification

Abstract

Access this article

Similar content being viewed by others

Differentially private nearest neighbor classification

PEM: A Practical Differentially Private System for Large-Scale Cross-Institutional Data Mining

Local Differential Privacy for Private Construction of Classification Algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 3

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PrivPfC: differentially private data publication for classification

Abstract

Access this article

Similar content being viewed by others

Differentially private nearest neighbor classification

PEM: A Practical Differentially Private System for Large-Scale Cross-Institutional Data Mining

Local Differential Privacy for Private Construction of Classification Algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 3

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation