A Framework for Evaluating Privacy Preserving Data Mining Algorithms*

Bertino, Elisa; Fovino, Igor Nai; Provenza, Loredana Parasiliti

doi:10.1007/s10618-005-0006-6

A Framework for Evaluating Privacy Preserving Data Mining Algorithms^*

Published: 18 August 2005

Volume 11, pages 121–154, (2005)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Elisa Bertino¹,
Igor Nai Fovino² &
Loredana Parasiliti Provenza²

2003 Accesses
94 Citations
6 Altmetric
Explore all metrics

Abstract

Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements.

In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Analysis of Privacy Preservation Techniques in Data Mining

Privacy-Preserving Data Mining Techniques: Survey and Challenges

Responsibly Innovating Data Mining and Profiling Tools: A New Approach to Discrimination Sensitive and Privacy Sensitive Attributes

Notes

Health Insurance Portability and Accountability Act

References

Agrawal, D. and Aggarwal, C.C. 2001. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART symposium on principle of database system, ACM, pp. 247–255.
Agrawal, R. and Srikant, R. 2000. Privacy preserving data mining. In Proceeedings of the ACM SIGMOD conference of management of data, ACM, pp. 439–450.
Ballou, D. and Pazer, H. 1985. Modelling data and process quality in multi input, multi output information systems. Management science, 31(2):150–162.
Article Google Scholar
Domingo-Ferrer, J. and Torra, V. 2002. A quantitative comparison of disclosure control methods for microdata. In L. Zayatz, P. Doyle, J. Theeuwes and J. Lane (Eds.), Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, North-Holland, pp. 113–134.
Duncan, G.T., Keller-McNulty, S.A., and Stokes, S.L. 2001. Disclosure risks vs. data utility: The R-U confidentiality map (Tech. Rep. No. 121). National Institute of Statistical Sciences.
Dwork, C. and Nissim, K. 2004. Privacy preserving data mining in vertically partitioned database. In Crypto 2004, Vol. 3152, pp. 528–544.
MathSciNet Google Scholar
Evfimievski, A. 2002. Randomization in privacy preserving data mining. SIGKDD Explor. Newsl., 4(2):43–48.
Article Google Scholar
Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM-Press, pp. 217–228.
Kantarcioglu, M. and Clifton, C. 2002. Privacy preserving distributed mining of association rules on horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 24–31.
Kumar Tayi, G. and Ballou, D.P. 1998. Examining data quality. Communications of the ACM, 41(2):54–57.
Article Google Scholar
Oliveira, S.R.M. and Zaiane, O.R. 2002. Privacy preserving frequent itemset mining. In IEEE icdm Workshop on Privacy, Security and Data Mining, Vol. 14, pp. 43–54.
Oliveira, S.R.M. and Zaiane, O.R. 2004. Toward standardization in privacy preserving data mining. In ACM SIGKDD 3rd Workshop on Data Mining Standards, pp. 7–17.
Rizvi, S. and Haritsa, R. 2002. Maintaining data privacy in association rule mining. In 28th International Conference on Very Large Databases, pp. 682–693.
Sabanci University. 2003. Models and algorithms for privacy preserving data mining (Tech. Rep.). CODMINE Project.
Schoeman, F.D. 1984. Philosophical Dimensions of Privacy: An Anthology. Cambridge University Press.
Shannon, C.E. 1948. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656.
Google Scholar
Smyth, P. and Goodman, R.M. 1992. An information theoretic approach to rule induction from databases. IEEE Transaction On Knowledge And Data Engineering, 3(4):301–316.
Article Google Scholar
Sweeney, L. 2002. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):571–588.
Article MATH MathSciNet Google Scholar
Trottini, M. 2001. A decision-theoretic approach to data disclosure problems. Research in Official Statistics, 4:7–22.
Google Scholar
Trottini, M. 2003. Decision models for data disclosure limitation. Unpublished doctoral dissertation, Carnegie Mellon University. (Available at http://www.niss.org/dgii/TR/ThesisTrottini-final.pdf)
University of Milan, Computer Technology Institute and Sabanci University. 2002-2003. CODMINE - IST project. (Available at http://dke.cti.gr/CODMINE/)
Vaidya, J. and Clifton, C. 2002. Privacy preserving association rule mining in vertically partitioned data. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp. 639–644.
Verykios, V.S., Bertino, E., Nai Fovino, I., Parasiliti, L., Saygin, Y., and Theodoridis, Y. 2004. State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1):50–57.
Article Google Scholar
Walters, G.J. 2001. Human Rights in an Information Age: A Philosophical Analysis. In (chap. 5). University of Toronto Press.
Wang, R.Y. and Strong, D.M. 1996. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4):5–34.
MATH Google Scholar
Willenborg, L. and De Waal, T. 2001. Elements of Statistical Disclosure Control, Vol. 155. Springer.

Download references

Author information

Authors and Affiliations

CERIAS and CS Department, Purdue University, Indiana, USA
Elisa Bertino
Dipartimento di Informatica e Comunicazione, Università degli Studi di Milano, Milan, Italy
Igor Nai Fovino & Loredana Parasiliti Provenza

Authors

Elisa Bertino
View author publications
You can also search for this author in PubMed Google Scholar
Igor Nai Fovino
View author publications
You can also search for this author in PubMed Google Scholar
Loredana Parasiliti Provenza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elisa Bertino.

Additional information

^*The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of CERIAS.

Editor:

Geoff Webb

Appendix A: Proofs

Proof-of Theorem 4.1:

First, Algorithm GIH sorts the itemset L _h wrt the items it contains and the particular items it supports. In general, sorting N number has a complexity in O(Nlog(N)). However, in our case the length of transactions and thus of the considered itemsets has an upper bound that is very small compared to the size of the database. Then, Algorithm GIH takes O(|L _h|) to sort the set L _h of large itemsets to hide according to a descending order based on their size and minimum support. For each itemset Z in L _h, it takes O(|D|) in order to generate the set T _Z of the transactions in D that support Z, assuming that the transaction length is bounded by a constant. In addition to that, the algorithm sorts T _Z in ascending order of transaction size. Since the transaction length is bound, the sorting algorithm is of order O(|T _Z|). Then, the item i ∈ Z with highest minimum support is selected and a question mark is placed for that item in the transaction with minimum size, and this is repeated until the minimum support of the itemset Z goes below the MST by SM. After k iterations, the minimum support of Z will be \({\it minsup}(Z)^{(k)}=\frac{|T_Z|- k}{|D|}\); thus the number of steps required to hide Z is |T _Z| − (MST − SM) * |D|. Since |T _Z| ≤ |D| and [(MST − SM) * |D|] ≤ |D| we can state that Algorithm GIH requires O(|L _h| * |D|) time to hide all the itemsets belonging to L _h and consequently all the sensitive rules, whose generating itemsets are stored in L _h.

Proof-of Theorem 4.2:

For each sensitive rule r to hide, Algorithm CR performs the following tasks: it generates the set T _r of transactions of D supporting r, taking O(|D|* ATL), then for each transaction t in T _r it counts the number of items in t, that is of order O(|T _r|* ATL). To sort the transactions in T _r in ascending order of the number of items supported the Algorithm CR takes \(O(|T'_{l_r}|)\) for the reason explained above. To hide the selected rule r, the algorithm executes the inner loop until the minimum support becomes lower than MST by SM or minimum confidence becomes lower than MCT by SM. The minsup(r) and minconf(r) are initially equal to \(\frac{|T_r|}{|D|}\) and \(\frac{|T_r|}{|T_{l_r}|}\) respectively, and after k iterations the fractions become \(\frac{|T_r|-k}{|D|}\) for the minsup(r) and \(\frac{|T_r|-k}{|T_{l_r}|}\) for the minconf(r). This implies that the inner loop executes until \(\frac{|T_r|-k}{|D|}<MST-SM\) or \(\frac{|T_r|-k}{|T_{l_r}|}<MCT-SM\), that is, \(k=min(|D|\ast(minsup(r)-(MST-SM))), |D|\ast(minsup(r)-(MCT-SM)\ast minsup(l_r)))\), which can be assessed in both cases as being O(|D|). The cost of each iteration is given by the choose_item operation, which takes O(1) time. Therefore, Algorithm CR takes O(|R _h|* A _D) to hide all the sensitive rules in R _h.

Proof-of Theorem 4.3

For each rule r, Algorithm CR2 performs the following tasks: it generates the set \(T'_{l_r}\) of transactions of D that partially support l _r and do not fully support r _r, taking O(|D|* ATL), then for each transaction t in \(T'_{l_r}\) it counts the number of items of l _r in t, that is of order \(O(|T'_{l_r}|\ast ATL)\). To sort the transactions in \(T'_{l_r}\) in descending order of the calculated counts, the algorithm takes \(O(|T'_{l_r}|)\) in this particular case. To hide the selected rule r, the algorithm executes the inner loop until the minimum confidence becomes lower than MCT by SM. The minconf(r) is initially equal to \(\frac{minsup(r)}{maxsup(l_r)}=\frac{|T_r|}{|T'_{l_r}|}\), and after k iterations the fraction becomes \(\frac{|T_r|}{|T'_{l_r}|+k}\). This implies that the inner loop executes \(\lceil |D|(\frac{minsup(r)}{MCT-SM}-minsup(l_r))\rceil\) steps. The cost of each iteration takes O(1) time, thus Algorithm CR2 performs in \(O(|R_h|*A_D)\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertino, E., Fovino, I.N. & Provenza, L.P. A Framework for Evaluating Privacy Preserving Data Mining Algorithms^* . Data Min Knowl Disc 11, 121–154 (2005). https://doi.org/10.1007/s10618-005-0006-6

Download citation

Received: 27 October 2004
Accepted: 28 March 2005
Published: 18 August 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s10618-005-0006-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Framework for Evaluating Privacy Preserving Data Mining Algorithms^*

Abstract

Access this article

Similar content being viewed by others

An Analysis of Privacy Preservation Techniques in Data Mining

Privacy-Preserving Data Mining Techniques: Survey and Challenges

Responsibly Innovating Data Mining and Profiling Tools: A New Approach to Discrimination Sensitive and Privacy Sensitive Attributes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editor:

Appendix A: Proofs

Proof-of Theorem 4.1:

Proof-of Theorem 4.2:

Proof-of Theorem 4.3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Framework for Evaluating Privacy Preserving Data Mining Algorithms*

Abstract

Access this article

Similar content being viewed by others

An Analysis of Privacy Preservation Techniques in Data Mining

Privacy-Preserving Data Mining Techniques: Survey and Challenges

Responsibly Innovating Data Mining and Profiling Tools: A New Approach to Discrimination Sensitive and Privacy Sensitive Attributes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Editor:

Appendix A: Proofs

Appendix A: Proofs

Proof-of Theorem 4.1:

Proof-of Theorem 4.2:

Proof-of Theorem 4.3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

A Framework for Evaluating Privacy Preserving Data Mining Algorithms^*