Data preprocessing techniques for classification without discrimination
Regular Paper First Online: 03 December 2011 Received: 23 November 2010 Revised: 23 August 2011 Accepted: 16 November 2011 Abstract
Recently, the following
Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many settings, such as when the data are generated by a biased decision process or when the sensitive attribute serves as a proxy for unobserved features. In this paper, we concentrate on the case with only one binary sensitive attribute and a two-class classification problem. We first study the theoretically optimal trade-off between accuracy and non-discrimination for pure classifiers. Then, we look at algorithmic solutions that preprocess the data to remove discrimination before a classifier is learned. We survey and extend our existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances. These preprocessing techniques have been implemented in a modified version of Weka and we present the results of experiments on real-life data. Keywords Classification Preprocessing Discrimination-aware data mining
This paper is an extended version of the papers [
, 3 , 13 ]. 14 References
Asuncion A, Newman D (2007) UCI machine learning repository
Attorney-General’s Department C (1984) Australian sex discrimination act 1984. via.
Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: IEEE ICDM workshop on domain driven data mining. IEEE press
Calders T, Verwer S (2010) Three naive bayes approaches for discrimination- free classification. Data Min Knowl Discov 21(2): 277–292
MathSciNet CrossRef Google Scholar
Chan PK, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 164–168
Chao EL, Rones PL (2007) Women in the labor force: a databook. US Department of Labor and Bureau of Labor Statistics, Washington, DC
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Chawla NV, Hall LO, Joshi A (2005) Wrapper-based computation and evaluation of sampling methods for imbalanced datasets
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 155–164
Duivesteijn W, Feelders A (2008) Nearest neighbour classification with monotonicity constraints. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. Springer, pp 301–316
Dutch Central Bureau for Statistics (2001) Volkstelling.
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of IJCAI international joint conference on artificial intelligence, pp 973–978
Kamiran F, Calders T (2009a) Classifying without discriminating. In: Proceedings of IEEE IC4 international conference on computer, Control & Communication. IEEE press
Kamiran F, Calders T (2009b) Discrimination-aware classification. In: BNAIC Benelux conference on artificial intelligence
Kamiran F, Calders T, Pechenizkiy M (2010) Constructing decision trees under non-discriminatory constraints. In: Proceedings of IEEE ICDM international conference on data Mining. IEEE press
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2): 273–324
MATH CrossRef Google Scholar
Koknar-Tezel S, Latecki L (2010) Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 24(2): 1–23
Kotlowski W, Dembczynski K, Greco S, Slowinski R (2007) Statistical model for rough set approach to multicriteria classification. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. Springer
Luong B, Ruggieri S, Turini F (2011) k-nn as an implementation of situation testing for discrimination discovery and prevention. Technical Report TR-11-04, Dipartimento di Informatica, Universita di Pisa
Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. Technical report. Department of Computer Science, Oregon State University
Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining
Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: Proceedings of SIAM conference on data mining
Ruggieri S, Pedreschi D, Turini F (2010a) Dcube: discrimination discovery in databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1127–1130
Ruggieri S, Pedreschi D, Turini F (2010b) Integrating induction and deduction for finding evidence of discrimination. Artif Intell Law, 1–43
The US department of Justice U (2011) The us federal legislation. via.
Turner M, Skidmore F (1999) Mortgage lending discrimination: a review of existing evidence. Urban Institute Monograph Series on Race and Discrimination. Urban Institute Press
Turney P (2000) Cost-sensitive learning bibliography. Institute for Information Technology, National Research Council, Ottawa
US Department of Justice U (1974) Us equal credit opportunity act. via.
US Empl. Opp. Comm. E (1963) Us equal pay act. via.
Wang B, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst, 1–20
Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst, 1–13