## Abstract

Recently, a new class of data mining methods, known as *privacy preserving data mining* (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements.

In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed.

### Similar content being viewed by others

### Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.## Notes

Health Insurance Portability and Accountability Act

## References

Agrawal, D. and Aggarwal, C.C. 2001. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th ACM SIGACT-SIGMOD-SIGART symposium on principle of database system, ACM, pp. 247–255.

Agrawal, R. and Srikant, R. 2000. Privacy preserving data mining. In Proceeedings of the ACM SIGMOD conference of management of data, ACM, pp. 439–450.

Ballou, D. and Pazer, H. 1985. Modelling data and process quality in multi input, multi output information systems. Management science, 31(2):150–162.

Domingo-Ferrer, J. and Torra, V. 2002. A quantitative comparison of disclosure control methods for microdata. In L. Zayatz, P. Doyle, J. Theeuwes and J. Lane (Eds.), Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, North-Holland, pp. 113–134.

Duncan, G.T., Keller-McNulty, S.A., and Stokes, S.L. 2001. Disclosure risks vs. data utility: The R-U confidentiality map (Tech. Rep. No. 121). National Institute of Statistical Sciences.

Dwork, C. and Nissim, K. 2004. Privacy preserving data mining in vertically partitioned database. In Crypto 2004, Vol. 3152, pp. 528–544.

Evfimievski, A. 2002. Randomization in privacy preserving data mining. SIGKDD Explor. Newsl., 4(2):43–48.

Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM-Press, pp. 217–228.

Kantarcioglu, M. and Clifton, C. 2002. Privacy preserving distributed mining of association rules on horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 24–31.

Kumar Tayi, G. and Ballou, D.P. 1998. Examining data quality. Communications of the ACM, 41(2):54–57.

Oliveira, S.R.M. and Zaiane, O.R. 2002. Privacy preserving frequent itemset mining. In IEEE icdm Workshop on Privacy, Security and Data Mining, Vol. 14, pp. 43–54.

Oliveira, S.R.M. and Zaiane, O.R. 2004. Toward standardization in privacy preserving data mining. In ACM SIGKDD 3rd Workshop on Data Mining Standards, pp. 7–17.

Rizvi, S. and Haritsa, R. 2002. Maintaining data privacy in association rule mining. In 28th International Conference on Very Large Databases, pp. 682–693.

Sabanci University. 2003. Models and algorithms for privacy preserving data mining (Tech. Rep.). CODMINE Project.

Schoeman, F.D. 1984. Philosophical Dimensions of Privacy: An Anthology. Cambridge University Press.

Shannon, C.E. 1948. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656.

Smyth, P. and Goodman, R.M. 1992. An information theoretic approach to rule induction from databases. IEEE Transaction On Knowledge And Data Engineering, 3(4):301–316.

Sweeney, L. 2002. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5):571–588.

Trottini, M. 2001. A decision-theoretic approach to data disclosure problems. Research in Official Statistics, 4:7–22.

Trottini, M. 2003. Decision models for data disclosure limitation. Unpublished doctoral dissertation, Carnegie Mellon University. (Available at http://www.niss.org/dgii/TR/ThesisTrottini-final.pdf)

University of Milan, Computer Technology Institute and Sabanci University. 2002-2003. CODMINE - IST project. (Available at http://dke.cti.gr/CODMINE/)

Vaidya, J. and Clifton, C. 2002. Privacy preserving association rule mining in vertically partitioned data. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp. 639–644.

Verykios, V.S., Bertino, E., Nai Fovino, I., Parasiliti, L., Saygin, Y., and Theodoridis, Y. 2004. State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1):50–57.

Walters, G.J. 2001. Human Rights in an Information Age: A Philosophical Analysis. In (chap. 5). University of Toronto Press.

Wang, R.Y. and Strong, D.M. 1996. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4):5–34.

Willenborg, L. and De Waal, T. 2001. Elements of Statistical Disclosure Control, Vol. 155. Springer.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

^{*}The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of CERIAS.

### Editor:

Geoff Webb

## Appendix A: Proofs

###
**Appendix A: Proofs**

###
**Proof-of Theorem 4.1:**

First, Algorithm GIH sorts the itemset *L*
_{
h
} wrt the items it contains and the particular items it supports. In general, sorting *N* number has a complexity in *O*(*N*log(*N*)). However, in our case the length of transactions and thus of the considered itemsets has an upper bound that is very small compared to the size of the database. Then, Algorithm GIH takes *O*(|*L*
_{
h
}|) to sort the set *L*
_{
h
} of large itemsets to hide according to a descending order based on their size and minimum support. For each itemset *Z* in *L*
_{
h
}, it takes *O*(|*D*|) in order to generate the set *T*
_{
Z
} of the transactions in *D* that support *Z*, assuming that the transaction length is bounded by a constant. In addition to that, the algorithm sorts *T*
_{
Z
} in ascending order of transaction size. Since the transaction length is bound, the sorting algorithm is of order *O*(|*T*
_{
Z
}|). Then, the item *i* ∈ *Z* with highest minimum support is selected and a question mark is placed for that item in the transaction with minimum size, and this is repeated until the minimum support of the itemset *Z* goes below the *MST* by *SM*. After *k* iterations, the minimum support of *Z* will be \({\it minsup}(Z)^{(k)}=\frac{|T_Z|- k}{|D|}\); thus the number of steps required to hide *Z* is |*T*
_{
Z
}| − (*MST* − *SM*) * |*D*|. Since |*T*
_{
Z
}| ≤ |*D*| and [(*MST − SM*) * |*D*|] ≤ |*D*| we can state that Algorithm GIH requires *O*(|*L*
_{
h
}| * |*D*|) time to hide all the itemsets belonging to *L*
_{
h
} and consequently all the sensitive rules, whose generating itemsets are stored in *L*
_{
h
}.

###
**Proof-of Theorem 4.2:**

For each sensitive rule *r* to hide, Algorithm CR performs the following tasks: it generates the set *T*
_{
r
} of transactions of D supporting *r*, taking *O*(|*D*|* *ATL*), then for each transaction *t* in *T*
_{
r
} it counts the number of items in *t*, that is of order *O*(|*T*
_{
r
}|* *ATL*). To sort the transactions in *T*
_{
r
} in ascending order of the number of items supported the Algorithm CR takes \(O(|T'_{l_r}|)\) for the reason explained above. To hide the selected rule *r*, the algorithm executes the inner loop until the minimum support becomes lower than *MST* by *SM* or minimum confidence becomes lower than *MCT* by *SM*. The *minsup(r)* and *minconf(r)* are initially equal to \(\frac{|T_r|}{|D|}\) and \(\frac{|T_r|}{|T_{l_r}|}\) respectively, and after *k* iterations the fractions become \(\frac{|T_r|-k}{|D|}\) for the *minsup(r)* and \(\frac{|T_r|-k}{|T_{l_r}|}\) for the *minconf(r)*. This implies that the inner loop executes until \(\frac{|T_r|-k}{|D|}<MST-SM\) or \(\frac{|T_r|-k}{|T_{l_r}|}<MCT-SM\), that is, \(k=min(|D|\ast(minsup(r)-(MST-SM))), |D|\ast(minsup(r)-(MCT-SM)\ast minsup(l_r)))\), which can be assessed in both cases as being *O*(|*D*|). The cost of each iteration is given by the *choose_item* operation, which takes *O*(1) time. Therefore, Algorithm CR takes *O*(|*R*
_{
h
}|* *A*
_{
D
}) to hide all the sensitive rules in *R*
_{
h
}.

###
**Proof-of Theorem 4.3**

For each rule *r*, Algorithm CR2 performs the following tasks: it generates the set \(T'_{l_r}\) of transactions of D that partially support *l*
_{
r
} and do not fully support *r*
_{
r
}, taking *O*(|*D*|* *ATL*), then for each transaction *t* in \(T'_{l_r}\) it counts the number of items of *l*
_{
r
} in *t*, that is of order \(O(|T'_{l_r}|\ast ATL)\). To sort the transactions in \(T'_{l_r}\) in descending order of the calculated counts, the algorithm takes \(O(|T'_{l_r}|)\) in this particular case. To hide the selected rule *r*, the algorithm executes the inner loop until the minimum confidence becomes lower than *MCT* by *SM*. The *minconf(r)* is initially equal to \(\frac{minsup(r)}{maxsup(l_r)}=\frac{|T_r|}{|T'_{l_r}|}\), and after *k* iterations the fraction becomes \(\frac{|T_r|}{|T'_{l_r}|+k}\). This implies that the inner loop executes \(\lceil |D|(\frac{minsup(r)}{MCT-SM}-minsup(l_r))\rceil\) steps. The cost of each iteration takes *O*(1) time, thus Algorithm CR2 performs in \(O(|R_h|*A_D)\).

## Rights and permissions

## About this article

### Cite this article

Bertino, E., Fovino, I.N. & Provenza, L.P. A Framework for Evaluating Privacy Preserving Data Mining Algorithms^{*}
.
*Data Min Knowl Disc* **11**, 121–154 (2005). https://doi.org/10.1007/s10618-005-0006-6

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10618-005-0006-6