Skip to main content

Biases and Ethical Considerations for Machine Learning Pipelines in the Computational Social Sciences

  • Chapter
  • First Online:
Ethics in Artificial Intelligence: Bias, Fairness and Beyond

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1123))

  • 258 Accesses

Abstract

Computational analyses driven by Artificial Intelligence (AI)/Machine Learning (ML) methods to generate patterns and inferences from big datasets in computational social science (CSS) studies can suffer from biases during the data construction, collection and analysis phases as well as encounter challenges of generalizability and ethics. Given the interdisciplinary nature of CSS, many factors such as the need for a comprehensive understanding of different facets such as the policy and rights landscape, the fast-evolving AI/ML paradigms and dataset-specific pitfalls influence the possibility of biases being introduced. This chapter identifies challenges faced by researchers in the CSS field and presents a taxonomy of biases that may arise in AI/ML approaches. The taxonomy mirrors the various stages of common AI/ML pipelines: dataset construction and collection, data analysis and evaluation. By detecting and mitigating bias in AI, an active area of research, this chapter seeks to highlight practices for incorporating responsible research and innovation into CSS practices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://eugdpr.org/.

  2. 2.

    https://facctconference.org.

References

  1. Shah DV, Cappella JN, Neuman WR (2015) Big data, digital media, and computational social science: possibilities and perils. Ann Am Acad Politic Soc Sci 659(1):6–13. https://doi.org/10.1177/0002716215572084

    Article  Google Scholar 

  2. De S, Jassat U, Grace A, Wang W, Moessner K (2022) Mining composite spatio-temporal lifestyle patterns from geotagged social data. In: IEEE international conferences on internet of things (iThings) and IEEE green computing & communications (GreenCom) and IEEE cyber, physical & social computing (CPSCom) and IEEE smart data (SmartData) and IEEE congress on cybermatics (Cybermatics). Espoo, Finland, pp 444–451

    Google Scholar 

  3. Leslie D (2022) Don’t “research fast and break things": on the ethics of computational social science. arXiv, abs/2206.06370

    Google Scholar 

  4. Ramya Srinivasan R, Chander A (2021) Biases in AI systems: a survey for practitioners. ACM Queue 19(2)

    Google Scholar 

  5. De S, Moss H, Johnson J, Li J, Pereira H, Jabbari S (2022) Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires. IASSIST Quart 46(1)

    Google Scholar 

  6. Sharifian-Attar De S, Jabbari S, Li J, Moss H, Johnson J (2022) Analysing longitudinal social science questionnaires: topic modelling with BERT-based embeddings. In: Proceedings of 2022 ieee international conference on big data, Osaka, Japan, 2022, pp 5558–5567. https://doi.org/10.1109/BigData55660.2022.10020678

  7. Goodman A, Brown M, Silverwood RJ, Sakshaug JW, Calderwood L, Williams J, Ploubidis George B (2022) The impact of using the Web in a mixed-mode follow-up of a longitudinal birth cohort study: evidence from the national child development study. J Roy Stat Soc: Ser A (Stat Soc) 185(3):822–850

    Google Scholar 

  8. Herzog L (2021) Algorithmic bias and access to opportunities. In: Véliz C (ed) The oxford handbook of digital ethics. https://doi.org/10.1093/oxfordhb/9780198857815.013.21

  9. Spencer EA, Heneghan C (2017) Catalogue of bias collaboration. In: Catalogue of bias. https://catalogofbias.org/biases/

  10. Gebru T, Morgenstern J, Vecchione B, Wortman Vaughan J, Wallach H, Daumé III H, Crawford K (2021) Datasheets for datasets. Commun ACM 64(12):86–92. https://doi.org/10.1145/3458723

  11. Zhang BH, Lemoine B, Mitchell M (2018) mitigating unwanted biases with adversarial learning. In: Artificial intelligence, ethics, and society conference

    Google Scholar 

  12. Cofone IN (2019) Algorithmic discrimination is an information problem. Hastings Law J 70:1389–1444

    Google Scholar 

  13. Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal ME, ... Staab S (2020) Bias in data-driven artificial intelligence systems-an introductory survey. Wiley Interdiscip Rev: Data Min Knowl Discov 10(3): e1356

    Google Scholar 

  14. Hajian S (2013) Simultaneous discrimination prevention and privacy protection in data publishing and mining. arXiv:1306.6805

  15. Fish B, Kun J, Lelkes ÁD (2016) A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM international conference on data mining. Society for Industrial and Applied, pp 144–152

    Google Scholar 

  16. Kamishima T, Akaho S, Sakuma J (2021) Fairness-aware learning through regularization approach. In: 2011 IEEE 11th international conference on data mining workshops. IEEE, pp 643–650

    Google Scholar 

  17. Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Adv Neural Inf Process Syst 29

    Google Scholar 

  18. Celis LE, Huang L, Keswani V, Vishnoi NK (2019) Classification with fairness constraints: a meta-algorithm with provable guarantees. In: Proceedings of the conference on fairness, accountability, and transparency, pp 319–328

    Google Scholar 

  19. Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 35–50

    Google Scholar 

  20. Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International conference on machine learning. PMLR, pp 60–69

    Google Scholar 

  21. Canetti R, Cohen A, Dikkala N, Ramnarayan G, Scheffler S, Smith A (2019) From soft classifiers to hard decisions: how fair can we be?. In: Proceedings of the conference on fairness, accountability, and transparency, pp 309–318

    Google Scholar 

  22. Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: Proceedings of the 2009 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 581–592

    Google Scholar 

  23. Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: 2009 IEEE international conference on data mining workshops, pp 13–18

    Google Scholar 

  24. Wallach H (2018) Computational social science \(\ne \) computer science \(+\) social data. Commun ACM 61(3):42–44

    Google Scholar 

  25. Garcia M (2017) Racist in the machine: the disturbing implications of algorithmic bias. World Policy J 33(4):111–117

    Google Scholar 

  26. Zhao Q, Adeli E, Pohl KM (2020) Training confounder-free deep learning models for medical applications. Nat Commun 11(1):1–9

    Google Scholar 

  27. Jager KJ, Zoccali C, Macleod A, Dekker FW (2008) Confounding: what it is and how to deal with it. Kidney Int 73(3):256–260

    Google Scholar 

  28. Schwind C, Buder J (2012) Reducing confirmation bias and evaluation bias: when are preference-inconsistent recommendations effective-and when not?. Comput Hum Behav 28(6):280–2290

    Google Scholar 

  29. Shadowen N (2019) Ethics and bias in machine learning: a technical study of what makes us “good”. The transhumanism handbook. Springer, Cham, pp 247–261

    Google Scholar 

  30. Shankar S, Halpern Y, Breck E, Atwood J, Wilson J, Sculley D (2017) No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv:1711.08536

  31. Ghili S, Kazemi E, Karbasi A (2019) Eliminating latent discrimination: train then mask. Proc AAAI Conf Artif Intell 33(01): 3672–3680

    Google Scholar 

  32. He M, Hu X, Li C, Chen X, Wang J (2022) Mitigating confounding bias for recommendation via counterfactual inference. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD22)

    Google Scholar 

  33. Liu D, Cheng P, Zhu H, Dong Z, He X, Pan W, Ming Z (2021) Mitigating confounding bias in recommendation via information bottleneck. In: Fifteenth ACM conference on recommender systems, pp 351–360

    Google Scholar 

  34. Gnjatović M, Maček N, Adamović S (2020) Putting humans back in the loop: a study in human-machine cooperative learning. Acta Polytech Hungarica 17(2)

    Google Scholar 

  35. Demartini G, Mizzaro S, Spina D (2020) Human-in-the-loop artificial intelligence for fighting online misinformation: challenges and opportunities. IEEE Data Eng Bull 43(3):65–74

    Google Scholar 

  36. Agarwal V, Joglekar S, Young AP, Sastry N (2022) GraphNLI: a graph-based natural language inference model for polarity prediction in online debates. In: Proceedings of the ACM web conference 2022, pp 2729–2737

    Google Scholar 

  37. Young AP, Joglekar S, Agarwal V, Sastry N (2022) Modelling online debates with argumentation theory. ACM SIGWEB newsletter, (Spring), pp 1–9

    Google Scholar 

  38. Agarwal V, Young AP, Joglekar S, Sastry N (2022) A graph-based context-aware model to understand online conversations. arxiv:2211.09207

  39. Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1336–1350

    Google Scholar 

  40. Akhtar S, Basile V, Patti V (2020) Modeling annotator perspective and polarized opinions to improve hate speech detection. In: Proceedings of the AAAI conference on human computation and crowdsourcing, pp 151–154

    Google Scholar 

  41. Aroyo L, Dixon L, Thain N, Redfield O, Rosen R (2019) Crowdsourcing subjective tasks: the case study of understanding toxicity in online discussions. In: Companion proceedings of the 2019 World Wide Web conference, pp 1100–1105

    Google Scholar 

  42. Sheng VS, Zhang J, Gu B, Wu X (2017) Majority voting and pairing with multiple noisy labeling. IEEE Trans Knowl Data Eng 1355–1368

    Google Scholar 

  43. Wilms R, Mäthner E, Winnen L, Lanwehr R (2021) Omitted variable bias: a threat to estimating causal relationships. Methods Psychol 5:2021

    Article  Google Scholar 

  44. Nikolov D, Oliveira DF, Flammini A, Menczer F (2015) Measuring online social bubbles. Peer J Comput Sci 1:e38

    Article  Google Scholar 

  45. Ciampaglia GL, Menczer F (2018) Misinformation and biases infect social media, both intentionally and accidentally. The Conversation, 20

    Google Scholar 

  46. Chen J, Nairn R, Nelson L, Bernstein M, Chi E (2010) Short and tweet: experiments on recommending content from information streams. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI ’10), New York, NY, USA, pp 1185–1194

    Google Scholar 

  47. Olteanu A, Castillo C, Diaz F, Kıcıman E (2019) Social data: biases, methodological pitfalls, and ethical boundaries. Front Big Data 2

    Google Scholar 

  48. Cohen R, Ruths D (2013) Classifying political orientation on twitter: It’s not easy!. Proc Int AAAI Conf Web Soc Media 7(1):91–99

    Google Scholar 

  49. Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of google flu: traps in big data analysis. Science 343(6176):1203–1205

    Article  Google Scholar 

  50. Naveed N, Gottron T, Kunegis J, Alhadi AC (2011) Searching microblogs: coping with sparsity and document quality. In: Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, New York, pp 183–188

    Google Scholar 

  51. Gong W, Lim E-P, Zhu F, Cher PH (2016) On unravelling opinions of issue specific-silent users in social media. In: Proceedings of the international AAAI conference on web and social media, Cologne

    Google Scholar 

  52. Das S, Kramer A (2013) Self-censorship on facebook. In: Proceedings of the international AAAI conference on web and social media, Boston, MA

    Google Scholar 

  53. Wang Y, Norcie G, Komanduri S, Acquisti A, Leon PG, Cranor LF (2011) ‘i regretted the minute i pressed share’: a qualitative study of regrets on facebook. In: Proceedings of the seventh symposium on usable privacy and security, SOUPS ’11, New York, NY, pp 10:1–10:16

    Google Scholar 

  54. Tasse D, Liu Z, Sciuto A, Hong J (2017) State of the geotags: motivations and recent changes. In: Proceedings of the international AAAI conference on web and social media, Montreal, QC

    Google Scholar 

  55. Hecht B, Stephens M (2014) A tale of cities: urban biases in volunteered geographic information. In: Proceedings of the international AAAI conference on web and social media, Ann Arbor, M

    Google Scholar 

  56. Salganik MJ (2017) Bit by bit: Social research in the digital age. Princeton University Press, Princeton, NJ

    Google Scholar 

  57. Lampe C, Ellison NB, Steinfield C (2008) Changes in use and perception of Facebook. In: Proceedings of the 2008 ACM conference on computer supported cooperative work, CSCW’08. New York, NY, pp 721–730

    Google Scholar 

  58. Liu Y, Kliman-Silver C, Mislove A (2014) The tweets they are a-changin’: evolution of twitter users and behavior. In: Proceedings of the international AAAI conference on web and social media, Ann Arbor, MI

    Google Scholar 

  59. Danescu-Niculescu-Mizil C, West R, Jurafsky D, Leskovec J, Potts C (2013) No country for old members: user lifecycle and linguistic change in online communities. In: Proceedings of the 22nd international conference on world wide web,WWW’13. New York, NY, pp 307–318

    Google Scholar 

  60. Resnick P, Garrett RK, Kriplean T, Munson SA, Stroud NJ (2013) Bursting your (filter) bubble: strategies for promoting diverse exposure. In: Proceedings of the 2013 conference on computer supported cooperative work companion, CSCW’13. New York, NY, pp 95–100

    Google Scholar 

  61. Van Binh T, Minh D, Linh L, Van Nhan T (2023) Location-based service information disclosure on social networking sites: the effect of privacy calculus, subjective norms, trust, and cultural difference. Inf Serv & Use. 1–25

    Google Scholar 

  62. Newell ET, Dimitrov S, Piper A, Van Ruths D (2021) To buy or to read: how a platform shapes reviewing behavior. In: Proceedings of international conference on web and social media (ICWSM)

    Google Scholar 

  63. D’Alessio D, Allen M (2000) Media bias in presidential elections: a metaanalysis. J Commun 50:133–156

    Article  Google Scholar 

  64. Blodgett SL, Green L, O’Connor B (2016) Demographic dialectal variation in social media: a case study of African-American English. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, TX, pp 1119–1130

    Google Scholar 

  65. Liang H, Fu K-W (2015) Testing propositions derived from twitter studies: generalization and replication in computational social science. PLoS ONE 10:e0134270

    Article  Google Scholar 

  66. White RW (2016) Interactions with search systems. Cambridge University Press, Cambridge

    Book  Google Scholar 

  67. Radford J, Joseph K (2020) Theory in, theory out: the uses of social theory in machine learning for social science. Front Big Data 3:18

    Article  Google Scholar 

  68. Cerqueira V, Torgo L, Smailović J, Mozetič I (2017) A comparative study of performance estimation methods for time series forecasting. In: 2017 IEEE international conference on data science and advanced analytics (DSAA)8. IEEE, pp 529–53

    Google Scholar 

  69. Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, Association for Computational Linguistics, pp 1336–1350

    Google Scholar 

  70. Agarwal P, Hawkins O, Amaxopoulou M, Dempsey N, Sastry N, Wood E (2021) Hate speech in political discourse: a case study of UK MPs on twitter. In: Proceedings of the 32nd ACM conference on hypertext and social media (HT ’21). New York, NY, USA, pp 5–16

    Google Scholar 

  71. Zia HB, Raman A, Castro I, Anaobi IH, Cristofaro ED, Sastry N, Tyson G (2022) Toxicity in the decentralized web and the potential for model sharing. In: Proceedings of ACM measurement and analysis of computing system vol 6, 2, Article 35

    Google Scholar 

  72. Vidgen B, Thrush T, Waseem Z, Kiela D (2021) Learning from the worst: dynamically generated datasets to improve online hate detection. arXiv:2012.15761

  73. Yin W, Agarwal V, Jiang A, Zubiaga A, Sastry N (2023) AnnoBERT: effectively representing multiple annotators’ label choices to improve hate speech detection. Accepted In: The 17th international AAAI conference on web and social media (ICWSM)

    Google Scholar 

Download references

Acknowledgements

This research is funded by the UKRI Strategic Priority Fund as part of the wider Protecting Citizens Online programme (Grant number: EP/W032473/1) associated with the National Research Centre on Privacy, Harm Reduction and Adversarial Influence Online (REPHRAIN), and by the Science and Technology Facilities Council (STFC) DiRAC-funded “Understanding the multiple dimensions of prediction of concepts in social and biomedical science questionnaires” project, grant number ST/S003916/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suparna De .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Institution of Engineers (India)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

De, S., Jangra, S., Agarwal, V., Johnson, J., Sastry, N. (2023). Biases and Ethical Considerations for Machine Learning Pipelines in the Computational Social Sciences. In: Mukherjee, A., Kulshrestha, J., Chakraborty, A., Kumar, S. (eds) Ethics in Artificial Intelligence: Bias, Fairness and Beyond. Studies in Computational Intelligence, vol 1123. Springer, Singapore. https://doi.org/10.1007/978-981-99-7184-8_6

Download citation

Publish with us

Policies and ethics