Knowledge and Information Systems

, Volume 33, Issue 1, pp 1–33 | Cite as

Data preprocessing techniques for classification without discrimination

  • Faisal Kamiran
  • Toon Calders
Open Access
Regular Paper


Recently, the following Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many settings, such as when the data are generated by a biased decision process or when the sensitive attribute serves as a proxy for unobserved features. In this paper, we concentrate on the case with only one binary sensitive attribute and a two-class classification problem. We first study the theoretically optimal trade-off between accuracy and non-discrimination for pure classifiers. Then, we look at algorithmic solutions that preprocess the data to remove discrimination before a classifier is learned. We survey and extend our existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances. These preprocessing techniques have been implemented in a modified version of Weka and we present the results of experiments on real-life data.


Classification Preprocessing Discrimination-aware data mining 



We thank the anonymous reviewers for their insightful comments and the many suggestions that contributed substantially to the improvement of the document.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.


  1. 1.
    Asuncion A, Newman D (2007) UCI machine learning repositoryGoogle Scholar
  2. 2.
    Attorney-General’s Department C (1984) Australian sex discrimination act 1984. via.
  3. 3.
    Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: IEEE ICDM workshop on domain driven data mining. IEEE pressGoogle Scholar
  4. 4.
    Calders T, Verwer S (2010) Three naive bayes approaches for discrimination- free classification. Data Min Knowl Discov 21(2): 277–292MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chan PK, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 164–168Google Scholar
  6. 6.
    Chao EL, Rones PL (2007) Women in the labor force: a databook. US Department of Labor and Bureau of Labor Statistics, Washington, DCGoogle Scholar
  7. 7.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357zbMATHGoogle Scholar
  8. 8.
    Chawla NV, Hall LO, Joshi A (2005) Wrapper-based computation and evaluation of sampling methods for imbalanced datasetsGoogle Scholar
  9. 9.
    Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 155–164Google Scholar
  10. 10.
    Duivesteijn W, Feelders A (2008) Nearest neighbour classification with monotonicity constraints. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. Springer, pp 301–316Google Scholar
  11. 11.
    Dutch Central Bureau for Statistics (2001) Volkstelling.
  12. 12.
    Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of IJCAI international joint conference on artificial intelligence, pp 973–978Google Scholar
  13. 13.
    Kamiran F, Calders T (2009a) Classifying without discriminating. In: Proceedings of IEEE IC4 international conference on computer, Control & Communication. IEEE pressGoogle Scholar
  14. 14.
    Kamiran F, Calders T (2009b) Discrimination-aware classification. In: BNAIC Benelux conference on artificial intelligenceGoogle Scholar
  15. 15.
    Kamiran F, Calders T, Pechenizkiy M (2010) Constructing decision trees under non-discriminatory constraints. In: Proceedings of IEEE ICDM international conference on data Mining. IEEE pressGoogle Scholar
  16. 16.
    Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2): 273–324zbMATHCrossRefGoogle Scholar
  17. 17.
    Koknar-Tezel S, Latecki L (2010) Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 24(2): 1–23Google Scholar
  18. 18.
    Kotlowski W, Dembczynski K, Greco S, Slowinski R (2007) Statistical model for rough set approach to multicriteria classification. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. SpringerGoogle Scholar
  19. 19.
    Luong B, Ruggieri S, Turini F (2011) k-nn as an implementation of situation testing for discrimination discovery and prevention. Technical Report TR-11-04, Dipartimento di Informatica, Universita di PisaGoogle Scholar
  20. 20.
    Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. Technical report. Department of Computer Science, Oregon State UniversityGoogle Scholar
  21. 21.
    Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data miningGoogle Scholar
  22. 22.
    Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: Proceedings of SIAM conference on data miningGoogle Scholar
  23. 23.
    Ruggieri S, Pedreschi D, Turini F (2010a) Dcube: discrimination discovery in databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1127–1130Google Scholar
  24. 24.
    Ruggieri S, Pedreschi D, Turini F (2010b) Integrating induction and deduction for finding evidence of discrimination. Artif Intell Law, 1–43Google Scholar
  25. 25.
    The European Court of Justice E (2011) The European court of justice ruling. via.
  26. 26.
    The US department of Justice U (2011) The us federal legislation. via.
  27. 27.
    Turner M, Skidmore F (1999) Mortgage lending discrimination: a review of existing evidence. Urban Institute Monograph Series on Race and Discrimination. Urban Institute PressGoogle Scholar
  28. 28.
    Turney P (2000) Cost-sensitive learning bibliography. Institute for Information Technology, National Research Council, OttawaGoogle Scholar
  29. 29.
    US Department of Justice U (1974) Us equal credit opportunity act. via.
  30. 30.
    US Empl. Opp. Comm. E (1963) Us equal pay act. via.
  31. 31.
    Wang B, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst, 1–20Google Scholar
  32. 32.
    Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst, 1–13Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.EindhovenThe Netherlands
  2. 2.EindhovenThe Netherlands

Personalised recommendations