Data Mining and Knowledge Discovery

, Volume 28, Issue 5–6, pp 1158–1188 | Cite as

Generalization-based privacy preservation and discrimination prevention in data publishing and mining

  • Sara Hajian
  • Josep Domingo-Ferrer
  • Oriol Farràs


Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.


Data mining Anti-discrimination Privacy Generalization 



The authors wish to thank Kristen LeFevre for providing the implementation of the Incognito algorithm and Guillem Rufian-Torrell for helping in the implementation of the algorithm proposed in this paper. This work was partly supported by the Government of Catalonia under Grant 2009 SGR 1135, by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY”, TIN2012-32757 “ICWT” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES”, and by the European Comission under FP7 projects “DwB” and “INTER-TRUST”. The second author is partially supported as an ICREA Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but they are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization.


  1. Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, BerlinGoogle Scholar
  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, pp 487–499Google Scholar
  3. Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450Google Scholar
  4. Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—Queensland StateGoogle Scholar
  5. Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. Accessed 20 Jan 2014
  6. Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp 217–228Google Scholar
  7. Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-aware data mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 344–351Google Scholar
  8. Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Mining Knowl Discov 21(2):277–292CrossRefMathSciNetGoogle Scholar
  9. Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the information society—data mining and profiling in large databases. Studies in applied philosophy, epistemology and rational ethics, vol 3. Springer, BerlinGoogle Scholar
  10. Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212CrossRefMathSciNetGoogle Scholar
  11. Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112Google Scholar
  12. Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695CrossRefGoogle Scholar
  13. Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012, ACM, pp 214–226Google Scholar
  14. European Union Legislation (1995) Directive 95/46/ECGoogle Scholar
  15. European Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equality Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative resolution, P6\_TA(2009) 0211Google Scholar
  16. Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE 2005, IEEE, pp 205–216Google Scholar
  17. Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: concepts and techniques. Chapman & Hall/CRC, New YorkCrossRefGoogle Scholar
  18. Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination prevention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222Google Scholar
  19. Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans Knowl Data Eng 25(7):1445–1459CrossRefGoogle Scholar
  20. Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and privacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369Google Scholar
  21. Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination. In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 352–359Google Scholar
  22. Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012) Statistical disclosure control. Wiley, ChichesterGoogle Scholar
  23. Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288Google Scholar
  24. Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33(1):1–33CrossRefGoogle Scholar
  25. Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010, IEEE, pp 869–874Google Scholar
  26. Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50Google Scholar
  27. Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD 2005, ACM, pp 49–60Google Scholar
  28. Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006, IEEE, p 25Google Scholar
  29. Li N, Li T, Venkatasubramanian S (2007) \(t\)-Closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: IEEE ICDE 2007, IEEE, pp 106–115Google Scholar
  30. Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53Google Scholar
  31. Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination discovery and prevention. In: KDD 2011, ACM, pp 502–510Google Scholar
  32. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) \(l\)-Diversity: privacy beyond \(k\)-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3Google Scholar
  33. Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD 2011, ACM, pp 493–501Google Scholar
  34. Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp 560–568Google Scholar
  35. Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: SDM 2009, SIAM, pp 581–592Google Scholar
  36. Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence of discrimination. In: ICAIL 2009, ACM, pp 157–166Google Scholar
  37. Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T, Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in applied philosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108Google Scholar
  38. Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans Knowl Discov Data (TKDD) 4(2):Article 9Google Scholar
  39. Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027CrossRefGoogle Scholar
  40. Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems (PODS 98), Seattle, WA, p 188Google Scholar
  41. Statistics Sweden (2001) Statistisk rjandekontroll av tabeller, databaser och kartor (Statistical disclosure control of tables, databases and maps, in Swedish). Statistics Sweden, Örebro. Accessed 20 Jan 2014
  42. Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIP TC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381Google Scholar
  43. Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570CrossRefzbMATHMathSciNetGoogle Scholar
  44. United States Congress (1963) US Equal Pay Act (EPA) (Pub. L. 88-38). Accessed 20 Jan 2014
  45. Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: ICDM 2004, IEEE, pp 249–256Google Scholar
  46. Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, BerlinCrossRefGoogle Scholar
  47. Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San FranciscoGoogle Scholar
  48. Zliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp 992–1001Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Sara Hajian
    • 1
  • Josep Domingo-Ferrer
    • 1
  • Oriol Farràs
    • 1
  1. 1.Department of Computer Engineering and Maths, UNESCO Chair in Data PrivacyUniversitat Rovira i VirgiliTarragonaCatalonia

Personalised recommendations