Skip to main content
Log in

Generalization-based privacy preservation and discrimination prevention in data publishing and mining

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Living in the information society facilitates the automatic collection of huge amounts of data on individuals, organizations, etc. Publishing such data for secondary analysis (e.g. learning models and finding patterns) may be extremely useful to policy makers, planners, marketing analysts, researchers and others. Yet, data publishing and mining do not come without dangers, namely privacy invasion and also potential discrimination of the individuals whose data are published. Discrimination may ensue from training data mining models (e.g. classifiers) on data which are biased against certain protected groups (ethnicity, gender, political preferences, etc.). The objective of this paper is to describe how to obtain data sets for publication that are: (i) privacy-preserving; (ii) unbiased regarding discrimination; and (iii) as useful as possible for learning models and finding patterns. We present the first generalization-based approach to simultaneously offer privacy preservation and discrimination prevention. We formally define the problem, give an optimal algorithm to tackle it and evaluate the algorithm in terms of both general and specific data analysis metrics (i.e. various types of classifiers and rule induction algorithms). It turns out that the impact of our transformation on the quality of data is the same or only slightly higher than the impact of achieving just privacy preservation. In addition, we show how to extend our approach to different privacy models and anti-discrimination legal concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The use of PD (resp., PND) attributes in decision making does not necessarily lead to (or exclude) discriminatory decisions (Ruggieri et al. 2010).

  2. In full-domain generalization if a value is generalized, all its instances are generalized. There are alternative generalization schemes, such as multi-dimensional generalization or cell generalization, in which some instances of a value may remain ungeneralized while other instances are generalized.

  3. Although algorithms using multi-dimensional or cell generalizations (e.g. the Mondrian algorithm, Lefevre et al. 2006) cause less information loss than algorithms using full-domain generalization, the former suffer from the problem of data exploration (Fung et al. 2010). This problem is caused by the co-existence of specific and generalized values in the generalized data set, which make data exploration and interpretation difficult for the data analyst.

  4. On the legal side, different measures are adopted worldwide; see Pedreschi et al. (2013) for parallels between different measures and anti-discrimination acts.

  5. Discrimination occurs when a group is treated “less favorably” than others.

  6. Discrimination of a group occurs when a higher proportion of people not in the group is able to comply with a qualifying criterion.

  7. \(\alpha \) states an acceptable level of discrimination according to laws and regulations. For example, the U.S. Equal Pay Act (United States Congress 1963) states that “a selection rate for any race, sex, or ethnic group which is less than four-fifths of the rate for the group with the highest rate will generally be regarded as evidence of adverse impact”. This amounts to using clift with \(\alpha =1.25\).

  8. http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:62011CJ0385:EN:Not

References

  • Aggarwal CC, Yu PS (eds) (2008) Privacy preserving data mining: models and algorithms. Springer, Berlin

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases, VLDB, pp 487–499

  • Agrawal R, Srikant R (2000) Privacy preserving data mining. In: ACM SIGMOD 2000, pp 439–450

  • Australian Legislation (2008) (a) Equal Opportunity Act—Victoria State, (b) Anti-Discrimination Act—Queensland State

  • Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml. Accessed 20 Jan 2014

  • Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: ICDE 2005: IEEE, pp 217–228

  • Berendt B, Preibusch S (2012) Exploring discrimination: a user-centric evaluation of discrimination-aware data mining. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 344–351

  • Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Mining Knowl Discov 21(2):277–292

    Article  MathSciNet  Google Scholar 

  • Custers B, Calders T, Schermer B, Zarsky TZ (eds) (2013) Discrimination and privacy in the information society—data mining and profiling in large databases. Studies in applied philosophy, epistemology and rational ethics, vol 3. Springer, Berlin

  • Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Mining Knowl Discov 11(2):195–212

    Article  MathSciNet  Google Scholar 

  • Dwork C (2006) Differential privacy. In: ICALP 2006, LNCS 4052, Springer, pp 112

  • Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):8695

    Article  Google Scholar 

  • Dwork C, Hardt M, Pitassi T, Reingold O, Zemel RS (2012) Fairness through awareness. In: ITCS 2012, ACM, pp 214–226

  • European Union Legislation (1995) Directive 95/46/EC

  • European Union Legislation (2009) (a) Race Equality Directive, 2000/43/EC, 2000; (b) Employment Equality Directive, 2000/78/EC, 2000; (c) Equal Treatment of Persons, European Parliament legislative resolution, P6\_TA(2009) 0211

  • Fung BCM, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: ICDE 2005, IEEE, pp 205–216

  • Fung BCM, Wang K, Fu AW-C, Yu P (2010) Introduction to privacy-preserving data publishing: concepts and techniques. Chapman & Hall/CRC, New York

    Book  Google Scholar 

  • Hajian S, Domingo-Ferrer J, Martínez-Ballesté A (2011) Rule protection for indirect discrimination prevention in data mining. In: MDAI 2011, LNCS 6820, Springer, pp 211–222

  • Hajian S, Domingo-Ferrer J (2013) A methodology for direct and indirect discrimination prevention in data mining. IEEE Trans Knowl Data Eng 25(7):1445–1459

    Article  Google Scholar 

  • Hajian S, Monreale A, Pedreschi D, Domingo-Ferrer J, Giannotti F (2012) Injecting discrimination and privacy awareness into pattern discovery. In: IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 360–369

  • Hajian S, Domingo-Ferrer J (2012) A study on the impact of data anonymization on anti-discrimination. In: 2012 IEEE 12th international conference on data mining workshops-ICDMW 2012, IEEE Computer Society, pp 352–359

  • Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte-Nordholt E, Spicer K, de Wolf P-P (2012) Statistical disclosure control. Wiley, Chichester

  • Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: SIGKDD 2002, ACM, pp 279288

  • Kamiran F, Calders T (2011) Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33(1):1–33

    Article  Google Scholar 

  • Kamiran F, Calders T, Pechenizkiy M (2010) Discrimination aware decision tree learning. In: ICDM 2010, IEEE, pp 869–874

  • Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: ECML/PKDD, LNCS 7524, Springer, pp 35–50

  • Lefevre K, Dewitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In SIGMOD 2005, ACM, pp 49–60

  • Lefevre K, Dewitt DJ, Ramakrishnan R (2006) Mondrian multidimensional k-anonymity. In: ICDE 2006, IEEE, p 25

  • Li N, Li T, Venkatasubramanian S (2007) \(t\)-Closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: IEEE ICDE 2007, IEEE, pp 106–115

  • Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Bellare M (ed) Advances in cryptology-CRYPTO’00, LNCS 1880, Springer, Berlin, pp 36–53

  • Loung BL, Ruggieri S, Turini F (2011) k-NN as an implementation of situation testing for discrimination discovery and prevention. In: KDD 2011, ACM, pp 502–510

  • Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) \(l\)-Diversity: privacy beyond \(k\)-anonymity. ACM Trans Knowl Discov Data (TKDD) 1(1):Article 3

  • Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: KDD 2011, ACM, pp 493–501

  • Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: KDD 2008, ACM, pp 560–568

  • Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: SDM 2009, SIAM, pp 581–592

  • Pedreschi D, Ruggieri S, Turini F (2009) Integrating induction and deduction for finding evidence of discrimination. In: ICAIL 2009, ACM, pp 157–166

  • Pedreschi D, Ruggieri S, Turini F (2013) The discovery of discrimination. In: Custers BHM, Calders T, Schermer BW, Zarsky TZ (eds) Discrimination and privacy in the information society: studies in applied philosophy, epistemology and rational, ethics. Springer, Berlin, pp 91–108

  • Ruggieri S, Pedreschi D, Turini F (2010) Data mining for discrimination discovery. ACM Trans Knowl Discov Data (TKDD) 4(2):Article 9

  • Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027

    Article  Google Scholar 

  • Samarati P, Sweeney L (1998) Generalizing data to provide anonymity when disclosing information. In: Proceedings of the 17th ACM SIGACTSIGMOD-SIGART symposium on principles of database systems (PODS 98), Seattle, WA, p 188

  • Statistics Sweden (2001) Statistisk rjandekontroll av tabeller, databaser och kartor (Statistical disclosure control of tables, databases and maps, in Swedish). Statistics Sweden, Örebro. http://www.scb.se/statistik/_publikationer/OV9999_2000I02_BR_X97P0102. Accessed 20 Jan 2014

  • Sweeney L (1998) Datafly: a system for providing anonymity in medical data. In: Proceedings of the IFIP TC11 WG11.3 11th international conference on database security XI: status and prospects, pp 356–381

  • Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570

    Article  MATH  MathSciNet  Google Scholar 

  • United States Congress (1963) US Equal Pay Act (EPA) (Pub. L. 88-38). http://www.eeoc.gov/eeoc/history/35th/thelaw/epa.html. Accessed 20 Jan 2014

  • Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: ICDM 2004, IEEE, pp 249–256

  • Willenborg L, de Waal T (1996) Elements of statistical disclosure control. Springer, Berlin

    Book  Google Scholar 

  • Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Zliobaite I, Kamiran F, Calders T (2011) Handling conditional discrimination. In: ICDM 2011, IEEE, pp 992–1001

Download references

Acknowledgments

The authors wish to thank Kristen LeFevre for providing the implementation of the Incognito algorithm and Guillem Rufian-Torrell for helping in the implementation of the algorithm proposed in this paper. This work was partly supported by the Government of Catalonia under Grant 2009 SGR 1135, by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY”, TIN2012-32757 “ICWT” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES”, and by the European Comission under FP7 projects “DwB” and “INTER-TRUST”. The second author is partially supported as an ICREA Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but they are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sara Hajian.

Additional information

Responsible editor: Guest Editors of PKDD 2014 (Dr. Toon Calders, Prof. Floriana Esposito, Prof. Eyke Hüllermeier and Dr. Rosa Meo).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hajian, S., Domingo-Ferrer, J. & Farràs, O. Generalization-based privacy preservation and discrimination prevention in data publishing and mining. Data Min Knowl Disc 28, 1158–1188 (2014). https://doi.org/10.1007/s10618-014-0346-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0346-1

Keywords

Navigation