Abstract
Computational analyses driven by Artificial Intelligence (AI)/Machine Learning (ML) methods to generate patterns and inferences from big datasets in computational social science (CSS) studies can suffer from biases during the data construction, collection and analysis phases as well as encounter challenges of generalizability and ethics. Given the interdisciplinary nature of CSS, many factors such as the need for a comprehensive understanding of different facets such as the policy and rights landscape, the fast-evolving AI/ML paradigms and dataset-specific pitfalls influence the possibility of biases being introduced. This chapter identifies challenges faced by researchers in the CSS field and presents a taxonomy of biases that may arise in AI/ML approaches. The taxonomy mirrors the various stages of common AI/ML pipelines: dataset construction and collection, data analysis and evaluation. By detecting and mitigating bias in AI, an active area of research, this chapter seeks to highlight practices for incorporating responsible research and innovation into CSS practices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Shah DV, Cappella JN, Neuman WR (2015) Big data, digital media, and computational social science: possibilities and perils. Ann Am Acad Politic Soc Sci 659(1):6–13. https://doi.org/10.1177/0002716215572084
De S, Jassat U, Grace A, Wang W, Moessner K (2022) Mining composite spatio-temporal lifestyle patterns from geotagged social data. In: IEEE international conferences on internet of things (iThings) and IEEE green computing & communications (GreenCom) and IEEE cyber, physical & social computing (CPSCom) and IEEE smart data (SmartData) and IEEE congress on cybermatics (Cybermatics). Espoo, Finland, pp 444–451
Leslie D (2022) Don’t “research fast and break things": on the ethics of computational social science. arXiv, abs/2206.06370
Ramya Srinivasan R, Chander A (2021) Biases in AI systems: a survey for practitioners. ACM Queue 19(2)
De S, Moss H, Johnson J, Li J, Pereira H, Jabbari S (2022) Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires. IASSIST Quart 46(1)
Sharifian-Attar De S, Jabbari S, Li J, Moss H, Johnson J (2022) Analysing longitudinal social science questionnaires: topic modelling with BERT-based embeddings. In: Proceedings of 2022 ieee international conference on big data, Osaka, Japan, 2022, pp 5558–5567. https://doi.org/10.1109/BigData55660.2022.10020678
Goodman A, Brown M, Silverwood RJ, Sakshaug JW, Calderwood L, Williams J, Ploubidis George B (2022) The impact of using the Web in a mixed-mode follow-up of a longitudinal birth cohort study: evidence from the national child development study. J Roy Stat Soc: Ser A (Stat Soc) 185(3):822–850
Herzog L (2021) Algorithmic bias and access to opportunities. In: Véliz C (ed) The oxford handbook of digital ethics. https://doi.org/10.1093/oxfordhb/9780198857815.013.21
Spencer EA, Heneghan C (2017) Catalogue of bias collaboration. In: Catalogue of bias. https://catalogofbias.org/biases/
Gebru T, Morgenstern J, Vecchione B, Wortman Vaughan J, Wallach H, Daumé III H, Crawford K (2021) Datasheets for datasets. Commun ACM 64(12):86–92. https://doi.org/10.1145/3458723
Zhang BH, Lemoine B, Mitchell M (2018) mitigating unwanted biases with adversarial learning. In: Artificial intelligence, ethics, and society conference
Cofone IN (2019) Algorithmic discrimination is an information problem. Hastings Law J 70:1389–1444
Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal ME, ... Staab S (2020) Bias in data-driven artificial intelligence systems-an introductory survey. Wiley Interdiscip Rev: Data Min Knowl Discov 10(3): e1356
Hajian S (2013) Simultaneous discrimination prevention and privacy protection in data publishing and mining. arXiv:1306.6805
Fish B, Kun J, Lelkes ÁD (2016) A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM international conference on data mining. Society for Industrial and Applied, pp 144–152
Kamishima T, Akaho S, Sakuma J (2021) Fairness-aware learning through regularization approach. In: 2011 IEEE 11th international conference on data mining workshops. IEEE, pp 643–650
Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. Adv Neural Inf Process Syst 29
Celis LE, Huang L, Keswani V, Vishnoi NK (2019) Classification with fairness constraints: a meta-algorithm with provable guarantees. In: Proceedings of the conference on fairness, accountability, and transparency, pp 319–328
Kamishima T, Akaho S, Asoh H, Sakuma J (2012) Fairness-aware classifier with prejudice remover regularizer. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 35–50
Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International conference on machine learning. PMLR, pp 60–69
Canetti R, Cohen A, Dikkala N, Ramnarayan G, Scheffler S, Smith A (2019) From soft classifiers to hard decisions: how fair can we be?. In: Proceedings of the conference on fairness, accountability, and transparency, pp 309–318
Pedreschi D, Ruggieri S, Turini F (2009) Measuring discrimination in socially-sensitive decision records. In: Proceedings of the 2009 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp 581–592
Calders T, Kamiran F, Pechenizkiy M (2009) Building classifiers with independency constraints. In: 2009 IEEE international conference on data mining workshops, pp 13–18
Wallach H (2018) Computational social science \(\ne \) computer science \(+\) social data. Commun ACM 61(3):42–44
Garcia M (2017) Racist in the machine: the disturbing implications of algorithmic bias. World Policy J 33(4):111–117
Zhao Q, Adeli E, Pohl KM (2020) Training confounder-free deep learning models for medical applications. Nat Commun 11(1):1–9
Jager KJ, Zoccali C, Macleod A, Dekker FW (2008) Confounding: what it is and how to deal with it. Kidney Int 73(3):256–260
Schwind C, Buder J (2012) Reducing confirmation bias and evaluation bias: when are preference-inconsistent recommendations effective-and when not?. Comput Hum Behav 28(6):280–2290
Shadowen N (2019) Ethics and bias in machine learning: a technical study of what makes us “good”. The transhumanism handbook. Springer, Cham, pp 247–261
Shankar S, Halpern Y, Breck E, Atwood J, Wilson J, Sculley D (2017) No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv:1711.08536
Ghili S, Kazemi E, Karbasi A (2019) Eliminating latent discrimination: train then mask. Proc AAAI Conf Artif Intell 33(01): 3672–3680
He M, Hu X, Li C, Chen X, Wang J (2022) Mitigating confounding bias for recommendation via counterfactual inference. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD22)
Liu D, Cheng P, Zhu H, Dong Z, He X, Pan W, Ming Z (2021) Mitigating confounding bias in recommendation via information bottleneck. In: Fifteenth ACM conference on recommender systems, pp 351–360
Gnjatović M, Maček N, Adamović S (2020) Putting humans back in the loop: a study in human-machine cooperative learning. Acta Polytech Hungarica 17(2)
Demartini G, Mizzaro S, Spina D (2020) Human-in-the-loop artificial intelligence for fighting online misinformation: challenges and opportunities. IEEE Data Eng Bull 43(3):65–74
Agarwal V, Joglekar S, Young AP, Sastry N (2022) GraphNLI: a graph-based natural language inference model for polarity prediction in online debates. In: Proceedings of the ACM web conference 2022, pp 2729–2737
Young AP, Joglekar S, Agarwal V, Sastry N (2022) Modelling online debates with argumentation theory. ACM SIGWEB newsletter, (Spring), pp 1–9
Agarwal V, Young AP, Joglekar S, Sastry N (2022) A graph-based context-aware model to understand online conversations. arxiv:2211.09207
Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1336–1350
Akhtar S, Basile V, Patti V (2020) Modeling annotator perspective and polarized opinions to improve hate speech detection. In: Proceedings of the AAAI conference on human computation and crowdsourcing, pp 151–154
Aroyo L, Dixon L, Thain N, Redfield O, Rosen R (2019) Crowdsourcing subjective tasks: the case study of understanding toxicity in online discussions. In: Companion proceedings of the 2019 World Wide Web conference, pp 1100–1105
Sheng VS, Zhang J, Gu B, Wu X (2017) Majority voting and pairing with multiple noisy labeling. IEEE Trans Knowl Data Eng 1355–1368
Wilms R, Mäthner E, Winnen L, Lanwehr R (2021) Omitted variable bias: a threat to estimating causal relationships. Methods Psychol 5:2021
Nikolov D, Oliveira DF, Flammini A, Menczer F (2015) Measuring online social bubbles. Peer J Comput Sci 1:e38
Ciampaglia GL, Menczer F (2018) Misinformation and biases infect social media, both intentionally and accidentally. The Conversation, 20
Chen J, Nairn R, Nelson L, Bernstein M, Chi E (2010) Short and tweet: experiments on recommending content from information streams. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI ’10), New York, NY, USA, pp 1185–1194
Olteanu A, Castillo C, Diaz F, Kıcıman E (2019) Social data: biases, methodological pitfalls, and ethical boundaries. Front Big Data 2
Cohen R, Ruths D (2013) Classifying political orientation on twitter: It’s not easy!. Proc Int AAAI Conf Web Soc Media 7(1):91–99
Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of google flu: traps in big data analysis. Science 343(6176):1203–1205
Naveed N, Gottron T, Kunegis J, Alhadi AC (2011) Searching microblogs: coping with sparsity and document quality. In: Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, New York, pp 183–188
Gong W, Lim E-P, Zhu F, Cher PH (2016) On unravelling opinions of issue specific-silent users in social media. In: Proceedings of the international AAAI conference on web and social media, Cologne
Das S, Kramer A (2013) Self-censorship on facebook. In: Proceedings of the international AAAI conference on web and social media, Boston, MA
Wang Y, Norcie G, Komanduri S, Acquisti A, Leon PG, Cranor LF (2011) ‘i regretted the minute i pressed share’: a qualitative study of regrets on facebook. In: Proceedings of the seventh symposium on usable privacy and security, SOUPS ’11, New York, NY, pp 10:1–10:16
Tasse D, Liu Z, Sciuto A, Hong J (2017) State of the geotags: motivations and recent changes. In: Proceedings of the international AAAI conference on web and social media, Montreal, QC
Hecht B, Stephens M (2014) A tale of cities: urban biases in volunteered geographic information. In: Proceedings of the international AAAI conference on web and social media, Ann Arbor, M
Salganik MJ (2017) Bit by bit: Social research in the digital age. Princeton University Press, Princeton, NJ
Lampe C, Ellison NB, Steinfield C (2008) Changes in use and perception of Facebook. In: Proceedings of the 2008 ACM conference on computer supported cooperative work, CSCW’08. New York, NY, pp 721–730
Liu Y, Kliman-Silver C, Mislove A (2014) The tweets they are a-changin’: evolution of twitter users and behavior. In: Proceedings of the international AAAI conference on web and social media, Ann Arbor, MI
Danescu-Niculescu-Mizil C, West R, Jurafsky D, Leskovec J, Potts C (2013) No country for old members: user lifecycle and linguistic change in online communities. In: Proceedings of the 22nd international conference on world wide web,WWW’13. New York, NY, pp 307–318
Resnick P, Garrett RK, Kriplean T, Munson SA, Stroud NJ (2013) Bursting your (filter) bubble: strategies for promoting diverse exposure. In: Proceedings of the 2013 conference on computer supported cooperative work companion, CSCW’13. New York, NY, pp 95–100
Van Binh T, Minh D, Linh L, Van Nhan T (2023) Location-based service information disclosure on social networking sites: the effect of privacy calculus, subjective norms, trust, and cultural difference. Inf Serv & Use. 1–25
Newell ET, Dimitrov S, Piper A, Van Ruths D (2021) To buy or to read: how a platform shapes reviewing behavior. In: Proceedings of international conference on web and social media (ICWSM)
D’Alessio D, Allen M (2000) Media bias in presidential elections: a metaanalysis. J Commun 50:133–156
Blodgett SL, Green L, O’Connor B (2016) Demographic dialectal variation in social media: a case study of African-American English. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, TX, pp 1119–1130
Liang H, Fu K-W (2015) Testing propositions derived from twitter studies: generalization and replication in computational social science. PLoS ONE 10:e0134270
White RW (2016) Interactions with search systems. Cambridge University Press, Cambridge
Radford J, Joseph K (2020) Theory in, theory out: the uses of social theory in machine learning for social science. Front Big Data 3:18
Cerqueira V, Torgo L, Smailović J, Mozetič I (2017) A comparative study of performance estimation methods for time series forecasting. In: 2017 IEEE international conference on data science and advanced analytics (DSAA)8. IEEE, pp 529–53
Guest E, Vidgen B, Mittos A, Sastry N, Tyson G, Margetts H (2021) An expert annotated dataset for the detection of online misogyny. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, Association for Computational Linguistics, pp 1336–1350
Agarwal P, Hawkins O, Amaxopoulou M, Dempsey N, Sastry N, Wood E (2021) Hate speech in political discourse: a case study of UK MPs on twitter. In: Proceedings of the 32nd ACM conference on hypertext and social media (HT ’21). New York, NY, USA, pp 5–16
Zia HB, Raman A, Castro I, Anaobi IH, Cristofaro ED, Sastry N, Tyson G (2022) Toxicity in the decentralized web and the potential for model sharing. In: Proceedings of ACM measurement and analysis of computing system vol 6, 2, Article 35
Vidgen B, Thrush T, Waseem Z, Kiela D (2021) Learning from the worst: dynamically generated datasets to improve online hate detection. arXiv:2012.15761
Yin W, Agarwal V, Jiang A, Zubiaga A, Sastry N (2023) AnnoBERT: effectively representing multiple annotators’ label choices to improve hate speech detection. Accepted In: The 17th international AAAI conference on web and social media (ICWSM)
Acknowledgements
This research is funded by the UKRI Strategic Priority Fund as part of the wider Protecting Citizens Online programme (Grant number: EP/W032473/1) associated with the National Research Centre on Privacy, Harm Reduction and Adversarial Influence Online (REPHRAIN), and by the Science and Technology Facilities Council (STFC) DiRAC-funded “Understanding the multiple dimensions of prediction of concepts in social and biomedical science questionnaires” project, grant number ST/S003916/1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Institution of Engineers (India)
About this chapter
Cite this chapter
De, S., Jangra, S., Agarwal, V., Johnson, J., Sastry, N. (2023). Biases and Ethical Considerations for Machine Learning Pipelines in the Computational Social Sciences. In: Mukherjee, A., Kulshrestha, J., Chakraborty, A., Kumar, S. (eds) Ethics in Artificial Intelligence: Bias, Fairness and Beyond. Studies in Computational Intelligence, vol 1123. Springer, Singapore. https://doi.org/10.1007/978-981-99-7184-8_6
Download citation
DOI: https://doi.org/10.1007/978-981-99-7184-8_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7183-1
Online ISBN: 978-981-99-7184-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)