Data preprocessing techniques for classification without discrimination

Kamiran, Faisal; Calders, Toon

doi:10.1007/s10115-011-0463-8

Data preprocessing techniques for classification without discrimination

Regular Paper
Open access
Published: 03 December 2011

Volume 33, pages 1–33, (2012)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Data preprocessing techniques for classification without discrimination

Download PDF

Faisal Kamiran¹ &
Toon Calders²

32k Accesses
551 Citations
67 Altmetric
11 Mentions
Explore all metrics

Abstract

Recently, the following Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many settings, such as when the data are generated by a biased decision process or when the sensitive attribute serves as a proxy for unobserved features. In this paper, we concentrate on the case with only one binary sensitive attribute and a two-class classification problem. We first study the theoretically optimal trade-off between accuracy and non-discrimination for pure classifiers. Then, we look at algorithmic solutions that preprocess the data to remove discrimination before a classifier is learned. We survey and extend our existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances. These preprocessing techniques have been implemented in a modified version of Weka and we present the results of experiments on real-life data.

Article PDF

SAMME.C2 algorithm for imbalanced multi-class classification

Article 24 July 2024

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

A new technique for classification method with imbalanced training data

Article 24 February 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Asuncion A, Newman D (2007) UCI machine learning repository
Attorney-General’s Department C (1984) Australian sex discrimination act 1984. via. http://www.comlaw.gov.au/Details/C2010C00056
Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: IEEE ICDM workshop on domain driven data mining. IEEE press
Calders T, Verwer S (2010) Three naive bayes approaches for discrimination- free classification. Data Min Knowl Discov 21(2): 277–292
Article MathSciNet Google Scholar
Chan PK, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 164–168
Chao EL, Rones PL (2007) Women in the labor force: a databook. US Department of Labor and Bureau of Labor Statistics, Washington, DC
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Chawla NV, Hall LO, Joshi A (2005) Wrapper-based computation and evaluation of sampling methods for imbalanced datasets
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining, pp 155–164
Duivesteijn W, Feelders A (2008) Nearest neighbour classification with monotonicity constraints. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. Springer, pp 301–316
Dutch Central Bureau for Statistics (2001) Volkstelling. http://easy.dans.knaw.nl/dms
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of IJCAI international joint conference on artificial intelligence, pp 973–978
Kamiran F, Calders T (2009a) Classifying without discriminating. In: Proceedings of IEEE IC4 international conference on computer, Control & Communication. IEEE press
Kamiran F, Calders T (2009b) Discrimination-aware classification. In: BNAIC Benelux conference on artificial intelligence
Kamiran F, Calders T, Pechenizkiy M (2010) Constructing decision trees under non-discriminatory constraints. In: Proceedings of IEEE ICDM international conference on data Mining. IEEE press
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2): 273–324
Article MATH Google Scholar
Koknar-Tezel S, Latecki L (2010) Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 24(2): 1–23
Google Scholar
Kotlowski W, Dembczynski K, Greco S, Slowinski R (2007) Statistical model for rough set approach to multicriteria classification. In: Proceedings of ECML/PKDD European conference on machine learning and principles and practice of knowledge discovery in databases. Springer
Luong B, Ruggieri S, Turini F (2011) k-nn as an implementation of situation testing for discrimination discovery and prevention. Technical Report TR-11-04, Dipartimento di Informatica, Universita di Pisa
Margineantu D, Dietterich T (1999) Learning decision trees for loss minimization in multi-class problems. Technical report. Department of Computer Science, Oregon State University
Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of ACM SIGKDD conference on knowledge discovery and data mining
Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: Proceedings of SIAM conference on data mining
Ruggieri S, Pedreschi D, Turini F (2010a) Dcube: discrimination discovery in databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1127–1130
Ruggieri S, Pedreschi D, Turini F (2010b) Integrating induction and deduction for finding evidence of discrimination. Artif Intell Law, 1–43
The European Court of Justice E (2011) The European court of justice ruling. via. http://ec.europa.eu/ireland/press_office/news_of_the_day/ecj-ruling-sex-discrimination-in-insurance-contracts_en.htm
The US department of Justice U (2011) The us federal legislation. via. http://www.justice.gov/crt
Turner M, Skidmore F (1999) Mortgage lending discrimination: a review of existing evidence. Urban Institute Monograph Series on Race and Discrimination. Urban Institute Press
Turney P (2000) Cost-sensitive learning bibliography. Institute for Information Technology, National Research Council, Ottawa
US Department of Justice U (1974) Us equal credit opportunity act. via. http://www.fdic.gov/regulations/laws/rules/6500-1200.html
US Empl. Opp. Comm. E (1963) Us equal pay act. via. http://www.eeoc.gov/laws/statutes/epa.cfm
Wang B, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst, 1–20
Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst, 1–13

Download references

Acknowledgments

We thank the anonymous reviewers for their insightful comments and the many suggestions that contributed substantially to the improvement of the document.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

HG 7.46, P.O. Box 513, 5600 MB, Eindhoven, The Netherlands
Faisal Kamiran
HG 7.82a, P.O. Box 513, 5600 MB, Eindhoven, The Netherlands
Toon Calders

Authors

Faisal Kamiran
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faisal Kamiran.

Additional information

This paper is an extended version of the papers [3, 13, 14].

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Kamiran, F., Calders, T. Data preprocessing techniques for classification without discrimination. Knowl Inf Syst 33, 1–33 (2012). https://doi.org/10.1007/s10115-011-0463-8

Download citation

Received: 23 November 2010
Revised: 23 August 2011
Accepted: 16 November 2011
Published: 03 December 2011
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10115-011-0463-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data preprocessing techniques for classification without discrimination

Abstract

Article PDF

Similar content being viewed by others

SAMME.C2 algorithm for imbalanced multi-class classification

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

A new technique for classification method with imbalanced training data

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data preprocessing techniques for classification without discrimination

Abstract

Article PDF

Similar content being viewed by others

SAMME.C2 algorithm for imbalanced multi-class classification

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

A new technique for classification method with imbalanced training data

Explore related subjects

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation