Skip to main content

Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias

Part of the Lecture Notes in Computer Science book series (LNAI,volume 13281)


Machine Learning can help overcome human biases in decision making by focussing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. In this paper, we introduce a novel algorithm, Mimic, which uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems. Experiments confirm that Mimic not only identifies potential biases in multi-cluster datasets which can be corrected early on but also improves classifier performance.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    Implementation and Supplementary Material:

  2. 2.

    The Central Limit Theorem states that a sequence of independent and identically distributed (i.i.d.) random variables converges almost surely to a Gaussian [10]. Since we can typically assume that real-world measurements are not perfectly i.i.d. but rather combinations of different effects, we will often observe this effect.


  1. Abreu, N.: Análise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em marketing, ISCTE-IUL, Lisbon (2011)

    Google Scholar 

  2. Bareinboim, E., Tian, J., Pearl, J.: Recovering from selection bias in causal and statistical inference. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, June 2014 (2014)

    Google Scholar 

  3. Bellamy, R.K.E., et al.: AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Develop. 63(4/5), 4:1–4:15 (2019).

  4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000).

  5. Dost, K., Taskova, K., Riddle, P., Wicker, J.: Your best guess when you know nothing: identification and mitigation of selection bias. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996–1001. IEEE (2020).

  6. Dua, D., Graff, C.: UCI ML repository (2017).

  7. Goel, N., Yaghini, M., Faltings, B.: Non-discriminatory machine learning through convex fairness criteria. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, April 2018 (2018)

    Google Scholar 

  8. Granichin, O., Volkovich, Z.V., Toledano-Kitai, D.: Cluster validation. In: Randomized Algorithms in Automatic Control and Data Mining. ISRL, vol. 67, pp. 163–228. Springer, Heidelberg (2015).

    CrossRef  Google Scholar 

  9. Hassani, B.K.: Societal bias reinforcement through machine learning: a credit scoring perspective. AI Ethics 1(3), 239–247 (2020).

    CrossRef  Google Scholar 

  10. Hoeffding, W., Robbins, H.: The central limit theorem for dependent random variables. Duke Math. J. 15(3), 773–780 (1948).

    CrossRef  MathSciNet  MATH  Google Scholar 

  11. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4), 411–430 (2000).

    CrossRef  Google Scholar 

  12. Lavalle, A., Maté, A., Trujillo, J.: An approach to automatically detect and visualize bias in data analytics. In: CEUR Workshop Proceedings of the 22nd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, vol. 2572. CEUR (2020)

    Google Scholar 

  13. Lyon, A.: Why are normal distributions normal? Br. J. Philos. Sci. 65(3), 621–649 (2014).

    CrossRef  MathSciNet  MATH  Google Scholar 

  14. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021).

    CrossRef  Google Scholar 

  15. Panch, T., Mattie, H., Atun, R.: Artificial intelligence and algorithmic bias: implications for health systems. J. Glob. Health 9(2), 010318 (2019).

    CrossRef  Google Scholar 

  16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Poulos, J., Valle, R.: Missing data imputation for supervised learning. Appl. Artif. Intell. 32(2), 186–196 (2018).

    CrossRef  Google Scholar 

  18. Rabanser, S., Günnemann, S., Lipton, Z.: Failing loudly: an empirical study of methods for detecting dataset shift. Adv. Neural Info. Process. Syst. 32, 1396–1408 (2019)

    Google Scholar 

  19. Rezaei, A., Liu, A., Memarrast, O., Ziebart, B.D.: Robust fairness under covariate shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9419–9427 (2021)

    Google Scholar 

  20. Smith, A.T., Elkan, C.: Making generative classifiers robust to selection bias. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 657–666 (2007).

  21. Stojanov, P., Gong, M., Carbonell, J., Zhang, K.: Low-dimensional density ratio estimation for covariate shift correction. Proc. Mach. Learn. Res. 89, 3449–3458 (2019)

    Google Scholar 

  22. Strack, B., Deshazo, J., Gennings, C., Olmo Ortiz, J.L., Ventura, S., et al.: Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res. Int. 2014, 781670 (2014).

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Katharina Dost .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dost, K., Duncanson, H., Ziogas, I., Riddle, P., Wicker, J. (2022). Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-05935-3

  • Online ISBN: 978-3-031-05936-0

  • eBook Packages: Computer ScienceComputer Science (R0)