Skip to main content

PrivBUD-Wise: Differentially Private Frequent Itemsets Mining in High-Dimensional Databases

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11641))

Abstract

In this paper, we study the problem of mining frequent itemsets in high-dimensional databases with differential privacy, and propose a novel algorithm, PrivBUD-Wise, which achieves high result utility as well as a high privacy level. Instead of limiting the cardinality of transactions by truncating or splitting approaches, which causes extra information loss and result in unsatisfactory performance in utility, PrivBUD-Wise doesn’t make any preprocessing on original database and guarantees high result utility by reducing extra \(privacy\ budget\) consumption on irrelevant itemsets as much as possible. To achieve that, we first propose a Report Noisy mechanism with optional number of reported itemsets: SRNM, and what is more important is that we give a strict proof for SRNM in the appendix. Moreover, PrivBUD-Wise first proposes a biased \(privacy\ budget\) allocation strategy and no assumption or estimation on the maximal cardinality needs to be made. The good performance in utility and efficiency of PrivBUD-Wise is shown by experiments on three real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://fimi.ua.ac.be/data/

  2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  3. Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)

    Google Scholar 

  4. Cheng, X., Su, S., Xu, S., Li, Z.: DP-Apriori: a differentially private frequent itemset mining algorithm based on transaction splitting. Comput. Secur. 50, 74–90 (2015)

    Article  Google Scholar 

  5. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1

    Chapter  Google Scholar 

  6. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  7. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theoret. Comput. Sci. 9(3–4), 211–407 (2014)

    MathSciNet  MATH  Google Scholar 

  8. Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)

    Google Scholar 

  9. Fanaeepour, M., Machanavajjhala, A.: PrivStream: differentially private event detection on data streams. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 145–147. ACM (2019)

    Google Scholar 

  10. Fournier-Viger, P., Lin, J.C.-W., Vo, B., Chi, T.T., Zhang, J., Le, H.B.: A survey of itemset mining. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 7(4), e1207 (2017)

    Google Scholar 

  11. Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)

    Article  MathSciNet  Google Scholar 

  12. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)

    Article  Google Scholar 

  13. Lee, J., Clifton, C.W.: Top-k frequent itemsets via differentially private FP-trees. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 931–940. ACM (2014)

    Google Scholar 

  14. Li, N., Qardaji, W., Su, D., Cao, J.: PrivBasis: frequent itemset mining with differential privacy. Proc. VLDB Endow. 5(11), 1340–1351 (2012)

    Article  Google Scholar 

  15. Li, S., Mu, N., Le, J., Liao, X.: Privacy preserving frequent itemset mining: maximizing data utility based on database reconstruction. Comput. Secur. 84, 17–34 (2019)

    Article  Google Scholar 

  16. Wang, N., Xiao, X., Yang, Y., Zhang, Z., Gu, Y., Yu, G.: PrivSuper: a superset-first approach to frequent itemset mining under differential privacy. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 809–820. IEEE (2017)

    Google Scholar 

  17. Zeng, C., Naughton, J.F., Cai, J.-Y.: On differentially private frequent itemset mining. Proc. VLDB Endow. 6(1), 25–36 (2012)

    Article  Google Scholar 

  18. Zhang, J., Xiao, X., Xie, X.: PrivTree: a differentially private algorithm for hierarchical decompositions. In: Proceedings of the 2016 International Conference on Management of Data, pp. 155–170. ACM (2016)

    Google Scholar 

Download references

Acknowledgement

This work is partially supported by National Natural Science Foundation of China (NSFC) under Grant No. 61772491, No. U170921, Natural Science Foundation of Jiangsu Province under Grant No. BK20161256, and Anhui Initiative in Quantum Information Technologies AHY150300.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Han .

Editor information

Editors and Affiliations

Appendices

Appendix A

Lemma 4

For \(\forall \ \delta >0\), and x is a draw from Lap(b), then:

$$\begin{aligned} \mathrm {P}[x\ge \delta +1]=e^{-\frac{1}{b}}\mathrm {P}[x\ge \delta ] \end{aligned}$$

where \(\mathrm {P}\) denotes the probability.

Proof

$$\begin{aligned} \frac{\mathrm {P}[x\ge \delta +1]}{\mathrm {P}[x\ge \delta ]}=\frac{\frac{1}{2b}\int _{\delta +1}^{\infty }e^{-\frac{x}{b}} dx}{\frac{1}{2b}\int _{\delta }^{\infty }e^{-\frac{x}{b}} dx}=\frac{e^{-\frac{\delta +1}{b}}}{e^{-\frac{\delta }{b}}}=e^{-\frac{1}{b}} \end{aligned}$$

Hence, this lemma follows.

Appendix B

Proof of Theorem 2: Fix \(D=D'\cup \{t\}\), where t is a transaction. Let v, respectively \(v'\), denote the vector of query counts of SRNM when the dataset is D, respectively \(D'\). We use m to denote the number of queries(equal to the number of candidate itemsets). Then we have:

  1. (1)

    \(v_{i}\ge v^{'}_{i}\) for \(\forall i\in [m]\);

  2. (2)

    \(1+v^{'}_{i}\ge v_{i}\) for \(\forall i\in [m]\);

Given an integer z, for every \(z'\in [z]\), fix any set \(j=(j_{1},j_{2}, \ldots ,j_{z'})\in [m]^{z'}\), to prove differential privacy, we want to bound the ratio(from above and below) of the probabilities that \((j_{1},j_{2}, \ldots ,j_{z'})\) is selected with D and with D.

Fix \(r_{-j}\), which is a draw from \([Lap(z/\epsilon )]^{m-z'}\) and is used for all noisy query counts except \(z'\) counts corresponding to \(j=(j_{1},j_{2}, \ldots ,j_{z'})\). We use \(\mathrm {P}[j|\theta ]\) to denote the probability that the outputs of SRNM is j under condition \(\theta \).

Firstly, we prove that \(\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\): For every \(k\in j\), define

$$\begin{aligned} r_{k}^{*}=\mathrm {min}_{r_{k}}:v_{k}+r_{k} > v_{i}+r_{i}, \forall i \in [m]\backslash j \end{aligned}$$

Then j is the output with D iff for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}\).

For all \(i\in [m]\backslash j, k\in j\):

$$ \begin{array}{c} v_{k}+r_{k}^{*}>v_{i}+r_{i}\\ \Rightarrow (1+v_{k}^{'})+r_{k}^{*}\ge v_{k}+r_{k}^{*}>v_{i}+r_{i}\ge v_{i}^{'}+r_{i}\\ \Rightarrow v_{k}^{'}+(r_{k}^{*}+1)>v_{i}^{'}+r_{i} \end{array} $$

So, if for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}+1\), then the output with \(D'\) will be j and the added noise will be \((r_{j},r_{-j})\). So we have:

$$\begin{aligned} \begin{aligned}&\mathrm {P}[j|D',r_{-j}]\ge \mathrm {P}[r_{k}\ge r_{k}^{*}+1|k\in j]= \prod \limits _{k\in j}\mathrm {P}[r_{k}\ge r_{k}^{*}+1]\\&\quad = \prod \limits _{k\in j}e^{-\frac{\epsilon }{z}}\mathrm {P}[r_{k}\ge r_{k}^{*}]=e^{-\frac{z'\epsilon }{z}}\mathrm {P}[j|D,r_{-j}]\ge e^{-\epsilon }\mathrm {P}[j|D,r_{-j}] \end{aligned} \end{aligned}$$
(3)

The second equality is due to Lemma 4. multiply by \(e^{\epsilon }\): \(\mathrm {P}[j|D,r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\)

We now prove that \(\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D',r_{-j}]\). For every \(k\in j\), define: \(r_{k}^{*}=\mathrm {min}_{r_{k}}:v_{k}^{'}+r_{k} > v_{i}^{'}+r_{i}, \forall i \in [m]\backslash j\), then j is the output when the dataset is \(D'\) iff for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}\).

For all \(i\in [m]\backslash j, k\in j\):

$$ \begin{array}{c} v_{k}^{'}+r_{k}^{*}>v_{i}^{'}+r_{i}\\ \Rightarrow 1+v_{k}+r_{k}^{*}\ge 1+v_{k}^{'}+r_{k}^{*}>1+v_{i}^{'}+r_{i}\ge v_{i}+r_{i} \end{array} $$

So, if for \(\forall k\in j\): \(r_{k}\ge r_{k}^{*}+1\), then the output with D will be j and the added noise will be \((r_{j},r_{-j})\). So we have:

$$\begin{aligned} \begin{aligned}&\mathrm {P}[j|D,r_{-j}]\ge \mathrm {P}[r_{k}\ge r_{k}^{*}+1|k\in j]= \prod \limits _{k\in j}\mathrm {P}[r_{k}\ge r_{k}^{*}+1]\\&\quad = \prod \limits _{k\in j}e^{-\frac{\epsilon }{z}}\mathrm {P}[r_{k}\ge r_{k}^{*}]=e^{-\frac{z'\epsilon }{z}}\mathrm {P}[j|D',r_{-j}]\ge e^{-\epsilon }\mathrm {P}[j|D',r_{-j}] \end{aligned} \end{aligned}$$
(4)

multiply by \(e^{\epsilon }\): \(\mathrm {P}[j|D',r_{-j}]\le e^{\epsilon }\mathrm {P}[j|D,r_{-j}]\). Hence this theorem follows.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, J., Han, K., Song, P., Xu, C., Gui, F. (2019). PrivBUD-Wise: Differentially Private Frequent Itemsets Mining in High-Dimensional Databases. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26072-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26071-2

  • Online ISBN: 978-3-030-26072-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics