PrivBUD-Wise: Differentially Private Frequent Itemsets Mining in High-Dimensional Databases

Xu, Jingxin; Han, Kai; Song, Pingping; Xu, Chaoting; Gui, Fei

doi:10.1007/978-3-030-26072-9_8

Jingxin Xu¹⁴,
Kai Han¹⁴,
Pingping Song¹⁵,
Chaoting Xu¹⁴ &
…
Fei Gui¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11641))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1385 Accesses
1 Citations

Abstract

In this paper, we study the problem of mining frequent itemsets in high-dimensional databases with differential privacy, and propose a novel algorithm, PrivBUD-Wise, which achieves high result utility as well as a high privacy level. Instead of limiting the cardinality of transactions by truncating or splitting approaches, which causes extra information loss and result in unsatisfactory performance in utility, PrivBUD-Wise doesn’t make any preprocessing on original database and guarantees high result utility by reducing extra $privacy\ budget$ consumption on irrelevant itemsets as much as possible. To achieve that, we first propose a Report Noisy mechanism with optional number of reported itemsets: SRNM, and what is more important is that we give a strict proof for SRNM in the appendix. Moreover, PrivBUD-Wise first proposes a biased $privacy\ budget$ allocation strategy and no assumption or estimation on the maximal cardinality needs to be made. The good performance in utility and efficiency of PrivBUD-Wise is shown by experiments on three real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

http://fimi.ua.ac.be/data/
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)
Google Scholar
Cheng, X., Su, S., Xu, S., Li, Z.: DP-Apriori: a differentially private frequent itemset mining algorithm based on transaction splitting. Comput. Secur. 50, 74–90 (2015)
Article Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)
Google Scholar
Fanaeepour, M., Machanavajjhala, A.: PrivStream: differentially private event detection on data streams. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 145–147. ACM (2019)
Google Scholar
Fournier-Viger, P., Lin, J.C.-W., Vo, B., Chi, T.T., Zhang, J., Le, H.B.: A survey of itemset mining. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 7(4), e1207 (2017)
Google Scholar
Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)
Article MathSciNet Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)
Article Google Scholar
Lee, J., Clifton, C.W.: Top-k frequent itemsets via differentially private FP-trees. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 931–940. ACM (2014)
Google Scholar
Li, N., Qardaji, W., Su, D., Cao, J.: PrivBasis: frequent itemset mining with differential privacy. Proc. VLDB Endow. 5(11), 1340–1351 (2012)
Article Google Scholar
Li, S., Mu, N., Le, J., Liao, X.: Privacy preserving frequent itemset mining: maximizing data utility based on database reconstruction. Comput. Secur. 84, 17–34 (2019)
Article Google Scholar
Wang, N., Xiao, X., Yang, Y., Zhang, Z., Gu, Y., Yu, G.: PrivSuper: a superset-first approach to frequent itemset mining under differential privacy. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 809–820. IEEE (2017)
Google Scholar
Zeng, C., Naughton, J.F., Cai, J.-Y.: On differentially private frequent itemset mining. Proc. VLDB Endow. 6(1), 25–36 (2012)
Article Google Scholar
Zhang, J., Xiao, X., Xie, X.: PrivTree: a differentially private algorithm for hierarchical decompositions. In: Proceedings of the 2016 International Conference on Management of Data, pp. 155–170. ACM (2016)
Google Scholar

Download references

Acknowledgement

This work is partially supported by National Natural Science Foundation of China (NSFC) under Grant No. 61772491, No. U170921, Natural Science Foundation of Jiangsu Province under Grant No. BK20161256, and Anhui Initiative in Quantum Information Technologies AHY150300.

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Jingxin Xu, Kai Han, Chaoting Xu & Fei Gui
School of Arts, Anhui University, Hefei, China
Pingping Song

Authors

Jingxin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Han
View author publications
You can also search for this author in PubMed Google Scholar
Pingping Song
View author publications
You can also search for this author in PubMed Google Scholar
Chaoting Xu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Gui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Han .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Jie Shao
Hong Kong Polytechnic University, Hong Kong, China
Man Lung Yiu
The University of Tokyo, Tokyo, Japan
Masashi Toyoda
Zhejiang University, Hangzhou, China
Dongxiang Zhang
National University of Singapore, Singapore, Singapore
Wei Wang
Peking University, Beijing, China
Bin Cui

Appendices

Appendix A

Lemma 4

For $\forall \ \delta >0$, and x is a draw from Lap(b), then:

$$\begin{aligned} \mathrm {P}[x\ge \delta +1]=e^{-\frac{1}{b}}\mathrm {P}[x\ge \delta ] \end{aligned}$$

where $\mathrm {P}$ denotes the probability.

Proof

$$\begin{aligned} \frac{\mathrm {P}[x\ge \delta +1]}{\mathrm {P}[x\ge \delta ]}=\frac{\frac{1}{2b}\int _{\delta +1}^{\infty }e^{-\frac{x}{b}} dx}{\frac{1}{2b}\int _{\delta }^{\infty }e^{-\frac{x}{b}} dx}=\frac{e^{-\frac{\delta +1}{b}}}{e^{-\frac{\delta }{b}}}=e^{-\frac{1}{b}} \end{aligned}$$

Hence, this lemma follows.

Appendix B

Proof of Theorem 2: Fix $D=D'\cup \{t\}$, where t is a transaction. Let v, respectively $v'$, denote the vector of query counts of SRNM when the dataset is D, respectively $D'$. We use m to denote the number of queries(equal to the number of candidate itemsets). Then we have:

(1)
$v_{i}\ge v^{'}_{i}$ for $\forall i\in [m]$;
(2)
$1+v^{'}_{i}\ge v_{i}$ for $\forall i\in [m]$;

Given an integer z, for every $z'\in [z]$, fix any set $j=(j_{1},j_{2}, \ldots ,j_{z'})\in [m]^{z'}$, to prove differential privacy, we want to bound the ratio(from above and below) of the probabilities that $(j_{1},j_{2}, \ldots ,j_{z'})$ is selected with D and with D.

Fix $r_{-j}$, which is a draw from $[Lap(z/\epsilon )]^{m-z'}$ and is used for all noisy query counts except $z'$ counts corresponding to $j=(j_{1},j_{2}, \ldots ,j_{z'})$. We use $\mathrm {P}[j|\theta ]$ to denote the probability that the outputs of SRNM is j under condition $\theta $.

Firstly, we prove that $\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]$: For every $k\in j$, define

$$\begin{aligned} r_{k}^{*}=\mathrm {min}_{r_{k}}:v_{k}+r_{k} > v_{i}+r_{i}, \forall i \in [m]\backslash j \end{aligned}$$

Then j is the output with D iff for $\forall k\in j$: $r_{k}\ge r_{k}^{*}$.

For all $i\in [m]\backslash j, k\in j$:

$$ \begin{array}{c} v_{k}+r_{k}^{*}>v_{i}+r_{i}\\ \Rightarrow (1+v_{k}^{'})+r_{k}^{*}\ge v_{k}+r_{k}^{*}>v_{i}+r_{i}\ge v_{i}^{'}+r_{i}\\ \Rightarrow v_{k}^{'}+(r_{k}^{*}+1)>v_{i}^{'}+r_{i} \end{array} $$

So, if for $\forall k\in j$: $r_{k}\ge r_{k}^{*}+1$, then the output with $D'$ will be j and the added noise will be $(r_{j},r_{-j})$. So we have:

$$\begin{aligned} \begin{aligned}&\mathrm {P}[j|D',r_{-j}]\ge \mathrm {P}[r_{k}\ge r_{k}^{*}+1|k\in j]= \prod \limits _{k\in j}\mathrm {P}[r_{k}\ge r_{k}^{*}+1]\\&\quad = \prod \limits _{k\in j}e^{-\frac{\epsilon }{z}}\mathrm {P}[r_{k}\ge r_{k}^{*}]=e^{-\frac{z'\epsilon }{z}}\mathrm {P}[j|D,r_{-j}]\ge e^{-\epsilon }\mathrm {P}[j|D,r_{-j}] \end{aligned} \end{aligned}$$

(3)

The second equality is due to Lemma 4. multiply by $e^{\epsilon }$: $\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]$

We now prove that $\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]$. For every $k\in j$, define: $r_{k}^{*}=\mathrm {min}_{r_{k}}:v_{k}^{'}+r_{k} > v_{i}^{'}+r_{i}, \forall i \in [m]\backslash j$, then j is the output when the dataset is $D'$ iff for $\forall k\in j$: $r_{k}\ge r_{k}^{*}$.

For all $i\in [m]\backslash j, k\in j$:

$$ \begin{array}{c} v_{k}^{'}+r_{k}^{*}>v_{i}^{'}+r_{i}\\ \Rightarrow 1+v_{k}+r_{k}^{*}\ge 1+v_{k}^{'}+r_{k}^{*}>1+v_{i}^{'}+r_{i}\ge v_{i}+r_{i} \end{array} $$

So, if for $\forall k\in j$: $r_{k}\ge r_{k}^{*}+1$, then the output with D will be j and the added noise will be $(r_{j},r_{-j})$. So we have:

$$\begin{aligned} \begin{aligned}&\mathrm {P}[j|D,r_{-j}]\ge \mathrm {P}[r_{k}\ge r_{k}^{*}+1|k\in j]= \prod \limits _{k\in j}\mathrm {P}[r_{k}\ge r_{k}^{*}+1]\\&\quad = \prod \limits _{k\in j}e^{-\frac{\epsilon }{z}}\mathrm {P}[r_{k}\ge r_{k}^{*}]=e^{-\frac{z'\epsilon }{z}}\mathrm {P}[j|D',r_{-j}]\ge e^{-\epsilon }\mathrm {P}[j|D',r_{-j}] \end{aligned} \end{aligned}$$

(4)

multiply by $e^{\epsilon }$: $\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D,r_{-j}]$. Hence this theorem follows.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, J., Han, K., Song, P., Xu, C., Gui, F. (2019). PrivBUD-Wise: Differentially Private Frequent Itemsets Mining in High-Dimensional Databases. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-26072-9_8
Published: 18 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26071-2
Online ISBN: 978-3-030-26072-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics