Skip to main content

Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach

Abstract

In the application of machine learning to real-life decision-making systems, e.g., credit scoring and criminal justice, the prediction outcomes might discriminate against people with sensitive attributes, leading to unfairness. The commonly used strategy in fair machine learning is to include fairness as a constraint or a penalization term in the minimization of the prediction loss, which ultimately limits the information given to decision-makers. In this paper, we introduce a new approach to handle fairness by formulating a stochastic multi-objective optimization problem for which the corresponding Pareto fronts uniquely and comprehensively define the accuracy-fairness trade-offs. We have then applied a stochastic approximation-type method to efficiently obtain well-spread and accurate Pareto fronts, and by doing so we can handle training data arriving in a streaming way.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. Our implementation code is available at https://github.com/sul217/MOO_Fairness.

References

  • Alexandropoulos S-AN, Aridas CK, Kotsiantis SB, Vrahatis MN (2019) Multi-objective evolutionary optimization algorithms for machine learning: a recent survey. In: Approximation and optimization, Springer, pp 35–55

  • Barocas S, Selbst AD (2016) Big data’s disparate impact. California Law Rev 104:671

    Google Scholar 

  • Bi J (2003) Multi-objective programming in SVMs. In: Proceedings of the 20th international conference on machine learning, pp 35–42,

  • Braga AP, Takahashi RH, Costa MA, de Albuquerque Teixeira R (2006) Multi-objective algorithms for neural networks learning. In: Multi-objective machine learning, Springer, pp 151–171

  • Calders T, Verwer S (2010) Three naive Bayes approaches for discrimination-free classification. Data Min Knowl Discov 21(2):277–292

    Article  Google Scholar 

  • Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: 2009 IEEE international conference on data mining workshops, IEEE, pp 13–18

  • Calmon F, Wei D, Vinzamuri B, Ramamurthy KN, Varshney KR (2017) Optimized pre-processing for discrimination prevention. In: Advances in Neural Information Processing Systems, pp 3992–4001

  • Custódio ALL, Madeira JA, Vaz AIF, Vicente LN (2011) Direct multisearch for multiobjective optimization. SIAM J Optim 21(3):1109–1140

    Article  Google Scholar 

  • Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evolut Comput 6(2):182–197

    Article  Google Scholar 

  • Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math program 91(2):201–213

    Article  Google Scholar 

  • Dua D, Graff C (2017) UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml

  • Fliege J, Svaiter BF (2000) Steepest descent methods for multicriteria optimization. Math Methods Oper Res 51(3):479–494

    Article  Google Scholar 

  • Fliege J, Vaz AIF, Vicente LN (2019) Complexity of gradient descent for multiobjective optimization. Optim Methods Softw 34(5):949–959

    Article  Google Scholar 

  • Fonseca CM, Paquete L, López-Ibánez M (2006) An improved dimension-sweep algorithm for the hypervolume indicator. In: 2006 IEEE international conference on evolutionary computation, IEEE, pp 1157–1163

  • Haimes YV (1971) On a bicriterion formulation of the problems of integrated system identification and system optimization. IEEE Trans Syst Man Cybern 1(3):296–297

    Google Scholar 

  • Handl J, Knowles J (2004) Evolutionary multiobjective clustering. In: International conference on parallel problem solving from nature, Springer, pp 1081–1091

  • Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Advances in neural information processing systems, pp 3315–3323

  • Igel C (2005) Multi-objective model selection for support vector machines. In: International conference on evolutionary multi-criterion optimization, Springer, pp 534–546

  • Jin Y (2006) Multi-objective machine learning, vol 16. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  • Jin Y, Sendhoff B (2008) Pareto-based multiobjective machine learning: an overview and case studies. IEEE Trans Syst Man Cybern Part C (Appl Rev) 38(3):397–415

    Article  Google Scholar 

  • Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, pp 35–50

  • Kamishima T, Akaho S, Sakuma J (2011) Fairness-aware learning through regularization approach. In: 2011 IEEE 11th international conference on data mining workshops, IEEE, pp 643–650

  • Kaoutar S, Mohamed, E (2017) Multi-criteria optimization of neural networks using multi-objective genetic algorithm. In: 2017 Intelligent systems and computer vision (ISCV), IEEE, pp 1–4

  • Kelly J (2020) Women now hold more jobs than men in the U.S. workforce. https://www.forbes.com/sites/jackkelly/2020/01/13/women-now-hold-more-jobs-than-men

  • Kim D (2004) Structural risk minimization on decision trees using an evolutionary multiobjective optimization. In: European conference on genetic programming, Springer, pp 338–348

  • Kohavi R (1996) Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96, AAAI Press, pp 202-207

  • Kokshenev I, Braga A.P (2008) A multi-objective learning algorithm for RBF neural network. In: 2008 10th Brazilian Symposium on Neural Networks, IEEE, pp 9–14

  • Kraft D (1998) A software package for sequential quadratic programming. Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft- und Raumfahrt

  • Larson J, Mattu S, Kirchner L, Angwin J (2016a) How we analyzed the COMPAS recidivism algorithm. ProPublica

  • Larson J, Mattu S, Kirchner L, Angwin J (2016b) ProPublica COMPAS dataset. https://github.com/propublica/compas-analysis

  • Law MH, Topchy AP,Jain AK (2004) Multiobjective data clustering. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, 2004. CVPR 2004., vol 2, IEEE, pp 424–430

  • Liu S, Vicente LN (2021) The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Ann Oper Res. https://doi.org/10.1007/s10479-021-04033-z

  • Mercier Q, Poirion F, Désidéri J-A (2018) A stochastic multiple gradient descent algorithm. European J. Oper. Res. 271(3):808–817

    Article  Google Scholar 

  • Munoz C, Smith M, Patil D of the President E.O (2016) Big data: A report on algorithmic systems, opportunity, and civil rights. Executive Office of the President

  • Navon A, Shamsian A, Chechik G, Fetaya E (2021) Learning the Pareto front with hypernetworks. In: International conference on learning representations

  • Pedreshi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 560–568. ACM

  • Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger KQ (2017) On fairness and calibration. In: Advances in neural information processing systems, pp 5680–5689

  • Podesta J, Pritzker P, Moniz EJ, Holdren J, Zients J (2014) Big data: seizing opportunities, preserving values. Technical Report, Executive Office of the President

  • Reiners M, Klamroth K, Stiglmayr M (2020) Efficient and sparse neural networks by pruning weights in a multiobjective learning approach. preprint arXiv:2008.13590

  • Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. In: Proceedings of the 32nd international conference on neural information processing systems, pp 525–536

  • Senhaji K, Ramchoun H, Ettaouil M (2020) Training feedforward neural network via multiobjective optimization model using non-smooth L1/2 regularization. Neurocomputing 410:1–11

    Article  Google Scholar 

  • Senhaji K, Ramchoun H, Ettaouil M (2017) Multilayer perceptron: NSGA II for a new multi-objective learning method for training and model complexity. In: First international conference on real time intelligent systems, pp 154–167. Springer

  • Varghese NV, Mahmoud QH (2020) A survey of multi-task deep reinforcement learning. Electronics 9:1363

    Article  Google Scholar 

  • Williamson RC, Menon AK (2019) Fairness risk measures. In: International conference on machine learning, pp 6786–6797

  • Woodworth B, Gunasekar S, Ohannessian MI, Srebro N (2017) Learning non-discriminatory predictors. In: Conference on Learning Theory, pp 1920–1953

  • Yusiong JPT, Naval PC (2006) Training neural networks using multiobjective particle swarm optimization. In: International conference on natural computation, pp 879–888. Springer

  • Zafar MB, Valera I, Rodriguez MG, Gummadi KP (2017a) Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In: Proceedings of the 26th international conference on world wide web, pp 1171–1180. International World Wide Web Conferences Steering Committee

  • Zafar MB, Valera I, Rodriguez MG, Gummadi KP (2017b) Fairness constraints: mechanisms for fair classification. In: Artificial intelligence and statistics, pp 962–970

  • Zemel R, Wu Y, Swersky K, Pitassi T, Dwork, C (2013) Learning fair representations. In: International conference on machine learning, pp 325–333

  • Zhang Y, Yang, Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3070203

  • Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Trans Evolut Comput 3:257–271

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Suyun Liu or Luis Nunes Vicente.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

L. N. Vicente: Support for this author was partially provided by the Centre for Mathematics of the University of Coimbra under grant FCT/MCTES UIDB/MAT/00324/2020.

Appendices

A. The stochastic multi-gradient (SMG) algorithm

figure a

B. Description and illustration of the Pareto-front stochastic multi-gradient algorithm

A formal description of the PF-SMG algorithm is given in Algorithm 2.

figure b

An illustration is provided in Fig. 6. The blue curve represents the true Pareto front. The PF-SMG algorithm first randomly generates a list of starting feasible points (see blue points in Fig. 6a).queryPlease check and confirm the inserted citation of Tables 1, 2, 3 are correct. If not, please suggest an alternative citation. Please note that Tables should be cited in sequential order in the text.For each point in the current list, a certain number of perturbed points (see green circles in Fig. 6a) are added to the list, after which multiple runs of the SMG algorithm are applied to each point in the current list. These newly generated points are marked by red circles in Fig. 6b. At the end of the current iteration, a new list for the next iteration is obtained by removing all the dominated points. As the algorithm proceeds, the front will move towards the true Pareto front.

Fig. 6
figure 6

Illustration of Pareto-Front stochastic multi-gradient algorithm

The complexity rates to determine a point in the Pareto front using stochastic multi-gradient are reported in Liu & Vicente (2021). However, in multiobjective optimization, as far as we know, there are no convergence or complexity results to determine the whole Pareto front (under reasonable assumptions that do not reduce to evaluating the objective functions in a set that is dense in the decision space).

C. Metrics for Pareto front comparison

Let \({\mathcal {A}}\) denote the set of algorithms/solvers and \({\mathcal {T}}\) denote the set of test problems. The Purity metric measures the accuracy of an approximated Pareto front. Let us denote \(F({\mathcal {P}}_{a, t})\) as an approximated Pareto front of problem t computed by algorithm a. We approximate the “true” Pareto front \(F({\mathcal {P}}_t)\) for problem t by all the nondominated points in \(\cup _{a \in {\mathcal {A}}} F({\mathcal {P}}_{a, t})\). Then, the Purity of a Pareto front computed by algorithm a for problem t is the ratio \(r_{a, t} = |F({\mathcal {P}}_{a, t}) \cap F({\mathcal {P}}_t)|/|F({\mathcal {P}}_{a, t})| \in [0, 1]\), which calculates the percentage of “true” nondominated solutions among all the nondominated points generated by algorithm a. A higher ratio value corresponds to a more accurate Pareto front.

The Spread metric is designed to measure the extent of the point spread in a computed Pareto front, which requires the computation of extreme points in the objective function space \({\mathbb {R}}^m\). Among the m objective functions, we select a pair of nondominated points in \({\mathcal {P}}_t\) with the highest pairwise distance (measured using \(f_i\)) as the pair of extreme points. More specifically, for a particular algorithm a, let \((x_{\min }^i, x_{\max }^i) \in {\mathcal {P}}_{a, t}\) denote the pair of nondominated points where \(x_{\min }^i = {{\,\mathrm{argmin}\,}}_{x \in {\mathcal {P}}_{a, t}} f_i(x)\) and \(x_{\max }^i = {{\,\mathrm{argmax}\,}}_{x \in {\mathcal {P}}_{a, t}} f_i(x)\). Then, the pair of extreme points is \((x_{\min }^k, x_{\max }^k)\) with \(k = {{\,\mathrm{argmax}\,}}_{i = 1, \ldots , m} f_i(x_{\max }^i) - f_i(x_{\min }^i)\).

The first Spread formula calculates the maximum size of the holes for a Pareto front. Assume algorithm a generates an approximated Pareto front with M points, indexed by \(1, \ldots , M\), to which the extreme points \(F(x_{\min }^k)\),\(F(x_{\max }^k)\) indexed by 0 and \(M+1\) are added. Denote the maximum size of the holes by \(\Gamma \). We have

$$\begin{aligned} \Gamma \;=\; \Gamma _{a, t} \;=\; \max _{i \in \{1, \ldots , m\}} \left( \max _{j \in \{1, \ldots , M\}}\{\delta _{i,j}\}\right) , \end{aligned}$$

where \(\delta _{i,j} = f_{i,j + 1} - f_{i, j}\), and we assume each of the objective function values \(f_i\) is sorted in an increasing order.

The second formula was proposed by Deb et al. (2002) for the case \(m = 2\) (and further extended to the case \(m \ge 2\) in Custódio et al. (2011)) and indicates how well the points are distributed in a Pareto front. Denote the point spread by \(\Delta \). It is computed by the following formula:

$$\begin{aligned} \Delta \;=\; \Delta _{a, t} \;=\; \max _{i \in \{1, \ldots , m\}} \left( \frac{\delta _{i, 0} + \delta _{i, M} + \sum _{j = 1}^{M-1}|\delta _{i, j} - {\bar{\delta }}_i|}{\delta _{i, 0} + \delta _{i, M} + (M-1){\bar{\delta }}_i} \right) , \end{aligned}$$

where \({\bar{\delta }}_i, i = 1, \ldots , m\) is the average of \(\delta _{i, j}\) over \(j = 1, \ldots , M -1\). Note that the lower \(\Gamma \) and \(\Delta \) are, the more well distributed the Pareto front is.

Fig. 7
figure 7

Illustration of hypervolume using a bi-objective example Fonseca et al. (2006)

Hypervolume (Zitzler & Thiele 1999) is another classical performance indicator taking into account both the quality of the individual Pareto points and also their overall objective space coverage. It essentially calculates the area/volume dominated by the provided set of nondominated solutions with respect to a reference point. Figure 7 demonstrates a bi-objective case where the area dominated by a set of points \(\{p^{(1)}, p^{(2)}, p^{(3)}\}\) with respect to the reference point r is shown in grey. In our experiments, we calculate hypervolume using the Pymoo package (see https://pymoo.org/misc/indicators.html).

D. Datasets generation and pre-processing

The synthetic data is formed by 20 sets of 2,000 binary classification data instances randomly generated from the same distributions setting specified in Zafar et al.(2017b, Section 4), specifically using an uniform distribution for generating binary labels Y, two different Gaussian distributions for generating 2-dimensional nonsensitive features Z, and a Bernoulli distribution for generating the binary sensitive attribute A.

The data pre-processing details for the Adult Income dataset are given below.

  1. 1.

    First, we combine all instances in adult.data and adult.test and remove those that values are missing for some attributes.

  2. 2.

    We consider the list of features: Age, Workclass, Education, Education number, Martial Status, Occupation, Relationship, Race, Sex, Capital gain, Capital loss, Hours per week, and Country. In the same way as the authors Zafar et al. (2017a) did for attribute Country, we reduced its dimension by merging all non-United-Stated countries into one group (Tables 1, 2, 3). Similarly for feature Education, where “Preschool”, “1st-4th”, “5th-6th”, and “7th-8th” are merged into one group, and “9th”, “10th”, “11th”, and “12th” into another.

  3. 3.

    Last, we did one-hot encoding for all the categorical attributes, and we normalized attributes of continuous value.

Table 1 Adult Income dataset: Gender
Table 2 Adult Income dataset: Race
Table 3 COMPAS dataset: Race

In terms of gender, the dataset contains \(67.5\%\) males (\(31.3\%\) high income) and \(32.5\%\) females (\(11.4\%\) high income). Similarly, the demographic compositions in terms of race are \(2.88\%\) Asian (\(28.3\%\)), \(0.96\%\) American-Indian (\(12.2\%\)), \(86.03\%\) White (\(26.2\%\)), \(9.35\%\) Black (\(1.2\%\)), and \(0.78\%\) Other (\(12.7\%\)), where the numbers in brackets are the percentages of high-income instances.

E. More numerical results

E.1 Disparate impact w.r.t. binary sensitive attribute

See Fig. 8.

Fig. 8
figure 8

Pareto front comparison for Adult Income dataset w.r.t. gender. Parameters used in PF-SMG: \(p_1=2, p_2 = 3, \alpha _0 = 2.1\) and then multiplied by 1/3 every 500 iterates of SMG, \(b_{1, k} = 80\times 1.018^k\), and \(b_{2, k} = 50\times 1.018^k\)

E.2 Disparate impact w.r.t. multi-valued sensitive attribute

See Fig. 9.

Fig. 9
figure 9

Pareto front comparison for Adult dataset w.r.t. race. Parameters used in PF-SMG: \(p_1 = 3, p_2 = 2, \alpha _0 = 3.0\) and multiplied by 1/3 every 100 iterates of SMG, \(b_{1, k} = 50\times 1.012^k\), and \(b_{2, k} = 30\times 1.012^k\)

E.3 Streaming data

See Fig. 10.

Fig. 10
figure 10

Updating Pareto fronts using streaming data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, S., Vicente, L.N. Accuracy and fairness trade-offs in machine learning: a stochastic multi-objective approach. Comput Manag Sci 19, 513–537 (2022). https://doi.org/10.1007/s10287-022-00425-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10287-022-00425-z