Abstract
Search engines decide what we see for a given search query. Since many people are exposed to information through search engines, it is fair to expect that search engines are neutral. However, search engine results do not necessarily cover all the viewpoints of a search query topic, and they can be biased towards a specific view since search engine results are returned based on relevance, which is calculated using many features and sophisticated algorithms where search neutrality is not necessarily the focal point. Therefore, it is important to evaluate the search engine results with respect to bias. In this work we propose novel web search bias evaluation measures which take into account the rank and relevance. We also propose a framework to evaluate web search bias using the proposed measures and test our framework on two popular search engines based on 57 controversial query topics such as abortion, medical marijuana, and gay marriage. We measure the stance bias (in support or against), as well as the ideological bias (conservative or liberal). We observe that the stance does not necessarily correlate with the ideological leaning, e.g. a positive stance on abortion indicates a liberal leaning but a positive stance on Cuba embargo indicates a conservative leaning. Our experiments show that neither of the search engines suffers from stance bias. However, both search engines suffer from ideological bias, both favouring one ideological leaning to the other, which is more significant from the perspective of polarisation in our society.
This is a preview of subscription content,
to check access.





Notes
We are referring to the notion of relevance defined in the literature as system relevance, or topical relevance which is the relevance predicted by the system.
We are referring to the notion of ideology perceived by the crowd workers.
References
(2018). Internetlivestats. http://www.internetlivestats.com/. Retrieved 2018-10-06.
(2018). Procon.org, procon.org - pros and cons of controversial issues. https://www.procon.org/. Retrieved 2018-07-31.
(2018). Search engine statistics 2018. https://www.smartinsights.com/search-engine-marketing/search-engine-statistics/. Retrieved 2018-10-06.
99Firms (2019). Search engine statistics. https://99firms.com/blog/search-engine-statistics/#gref. Retrieved 2019-09-06.
Aktolga, E., & Allan, J., (2013). Sentiment diversification with different biases. Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 593–602), ACM.
Alam, M. A., & Downey, D. (2014). Analyzing the content emphasis of web search engines. In Proceedings of the 37th international ACM SIGIR conference on Research and development in information retrieval (pp. 1083–1086), ACM.
Alonso, O., & Mizzaro, S. (2012). Using crowdsourcing for trec relevance assessment. Information Processing & Management, 48, 1053–1066.
Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 9–15.
Baeza-Yates, R. (2016). Data and algorithmic bias in the web. Proceedings of the 8th ACM Conference on Web Science (pp. 1–1), ACM.
Bakshy, E., Messing, S., & Adamic, L. A. (2015). Exposure to ideologically diverse news and opinion on facebook. Science, 348, 1130–1132.
Bargh, J. A., Gollwitzer, P. M., Lee-Chai, A., Barndollar, K., & Trötschel, R. (2001). The automated will: Nonconscious activation and pursuit of behavioral goals. Journal of Personality and Social Psychology, 81, 1014.
Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., & Chi, E. H. et al. (2019). Fairness in recommendation ranking through pairwise comparisons. arXiv:1903.00780.
Budak, C., Goel, S., & Rao, J. M. (2016). Fair and balanced? Quantifying media bias through crowdsourced content analysis. Public Opinion Quarterly, 80, 250–271.
Chelaru, S., Altingovde, I. S. & Siersdorfer, S. (2012). Analyzing the polarity of opinionated queries. In European Conference on Information Retrieval (pp. 463–467), Springer.
Chelaru, S., Altingovde, I. S., Siersdorfer, S., & Nejdl, W. (2013). Analyzing, detecting, and exploiting sentiment in web queries. ACM Transactions on the Web (TWEB), 8, 6.
Chen, X. & Yang, C. Z. (2006). Position paper: A study of web search engine bias and its assessment. IW3C2 WWW.
Chen, L., Ma, R., Hannák, A. & Wilson, C. (2018). Investigating the impact of gender on rank in resume search engines. In Proceedings of the 2018 chi conference on human factors in computing systems (pp. 1–14).
Culpepper, J. S., Diaz, F. & Smucker, M. D. (2018). Research frontiers in information retrieval: Report from the third strategic workshop on information retrieval in lorne (swirl 2018). ACM SIGIR Forum, vol. 52, pp. 46–47, ACM New York, NY, USA.
Demartini, G. & Siersdorfer, S. (2010). Dear search engine: what’s your opinion about...?: Sentiment analysis for semantic enrichment of web search results. In Proceedings of the 3rd International Semantic Search Workshop, (P. 4), ACM.
Diakopoulos, N., Trielli, D., Stark, J. & Mussenden, S. (2018). I vote for—how search informs our choice of candidate. Digital Dominance: The Power of Google, Amazon, Facebook, and Apple, M. Moore and D. Tambini (Eds.), 22.
Diaz, A. (2008). Through the google goggles: Sociopolitical bias in search engine design. Web search, (pp. 11–34), Springer.
Dutton, W. H., Reisdorf, B., Dubois, E. & Blank, G. (2017). Search and politics: The uses and impacts of search in Britain, France, Germany, Italy, Poland, Spain, and the United States.
Dutton, W. H., Blank, G., & Groselj, D. (2013). Cultures of the internet: the internet in Britain: Oxford Internet Survey 2013 Report. Oxford: Oxford Internet Institute.
Elisa Shearer, K. E. M. (2018). News use across social media platforms 2018. https://www.journalism.org/2018/09/10/news-use-across-social-media-platforms-2018/.
Epstein, R. & Robertson, R.E. (2017). A method for detecting bias in search rankings, with evidence of systematic bias related to the 2016 presidential election. Technical Report White Paper no. WP-17-02.
Epstein, R., & Robertson, R. E. (2015). The search engine manipulation effect (seme) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112, E4512–E4521.
Epstein, R., Robertson, R. E., Lazer, D., & Wilson, C. (2017). Suppressing the search engine manipulation effect (seme). Proceedings of the ACM: Human–Computer Interaction, 1, 42.
Fang, Y., Si, L., Somasundaram, N. & Yu, Z. (2012). Mining contrastive opinions on political texts using cross-perspective topic model. In Proceedings of the fifth ACM international conference on Web search and data mining (pp. 63–72), ACM.
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J. & Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 80–88, Association for Computational Linguistics.
Gao, R. & Shah, C. (2019). How fair can we go: Detecting the boundaries of fairness optimization in information retrieval. In: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (pp. 229–236).
Gao, R., & Shah, C. (2020). Toward creating a fairer ranking in search engine results. Information Processing & Management, 57, 102138.
Gentzkow, M., & Shapiro, J. M. (2010). What drives media slant? Evidence from us daily newspapers. Econometrica, 78, 35–71.
Geyik, S. C., Ambler, S. & Kenthapadi, K. (2019). Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2221–2231).
Ginger, G., & David, S. (2018). Google responds to trump, says no political motive in search results. https://www.reuters.com/article/us-usa-trump-tech-alphabet/google-responds-to-trump-says-no-political-motive-in-search-results-idUSKCN1LD1QP. Retrieved 2018-10-06.
Goldman, E. (2008). Search engine bias and the demise of search engine utopianism. Web Search (pp. 121–133), Springer.
Grimes, D. R. (2016). Impartial journalism is laudable. but false balance is dangerous. https://www.theguardian.com/science/blog/2016/nov/08/impartial-journalism-is-laudable-but-false-balance-is-dangerous. Retrieved 2019-08-15.
Hu, D., Jiang, S., E. Robertson, R. & Wilson, C. (2019). Auditing the partisanship of google search snippets. The World Wide Web Conference (pp. 693–704).
Institute, A. P. (2014). The personal news cycle: How americans choose to get their news. Reston: American Press Institute.
Kallus, N., & Zhou, A. (2019). The fairness of risk scores beyond classification: Bipartite ranking and the xauc metric. arXiv:1902.05826.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73.
Kulshrestha, J., Eslami, M., Messias, J., Zafar, M. B., Ghosh, S., Gummadi, K. P. & Karahalios, K. (2017). Quantifying search bias: Investigating sources of bias for political searches in social media. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (pp. 417–432), ACM.
Kulshrestha, J., Eslami, M., Messias, J., Zafar, M. B., Ghosh, S., Gummadi, K. P., et al. (2018). Search bias quantification: Investigating political bias in social media and web search. Information Retrieval Journal, 22(1–2), 188–227.
Lahoti, P., Garimella, K., & Gionis, A. (2018). Joint non-negative matrix factorization for learning ideological leaning on twitter. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 351–359).
Lawson, N., Eustice, K., Perkowitz, M. & Yetisgen-Yildiz, M. (2010). Annotating large email datasets for named entity recognition with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 71–79), Association for Computational Linguistics.
Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., Costa-Jussa, M. R. & Banchs, R. (2010). Opinion mining of spanish customer comments with non-expert annotations on mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on Creating speech and language data with Amazon’s mechanical turk (pp. 114–121), Association for Computational Linguistics.
Mowshowitz, A., & Kawaguchi, A. (2002a). Assessing bias in search engines. Information Processing & Management, 38, 141–156.
Mowshowitz, A., & Kawaguchi, A. (2002b). Bias on the web. Communications of the ACM, 45, 56–60.
Mowshowitz, A., & Kawaguchi, A. (2005). Measuring search engine bias. Information Processing & Management, 41, 1193–1205.
Mullainathan, S., & Shleifer, A. (2005). The market for news. American Economic Review, 95, 1031–1053.
Newman, N., Fletcher, R., Kalogeropoulos, A. & Nielsen, R. (2019). Reuters institute digital news report 2019, vol. 2019. Reuters Institute for the Study of Journalism.
Newman, N., Fletcher, R., Kalogeropoulos, A., Levy, D.A.L. & Nielsen, R. (2018). Reuters institute digital news report 2018, vol. 2018. Reuters Institute for the Study of Journalism.
Noble, S. U. (2018). Algorithms of Oppression: How search engines reinforce racism. New York: NYU Press.
Otterbacher, J., Bates, J. & Clough, P. (2017). Competent men and warm women: Gender stereotypes and backlash in image search results. In Proceedings of the 2017 chi conference on human factors in computing systems (pp. 6620–6631).
Otterbacher, J., Checco, A., Demartini, G. & Clough, P. (2018). Investigating user perception of gender bias in image search: The role of sexism. In The 41st International ACM SIGIR conference on research & development in information retrieval (pp. 933–936).
Pan, B., Hembrooke, H., Joachims, T., Lorigo, L., Gay, G., & Granka, L. (2007). In google we trust: Users’ decisions on rank, position, and relevance. Journal of Computer-Mediated Communication, 12, 801–823.
Räbiger, S., Gezici, G., Saygın, Y. & Spiliopoulou, M. (2018). Predicting worker disagreement for more effective crowd labeling. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), (pp. 179–188), IEEE.
Raji, I. D. & Buolamwini, J. (2019). Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 429–435).
Robertson, R. E., Lazer, D. & Wilson, C. (2018b). Auditing the personalization and composition of politically-related search engine results pages. In Proceedings of the 2018 World Wide Web Conference (pp. 955–965).
Robertson, R. E., Jiang, S., Joseph, K., Friedland, L., Lazer, D., & Wilson, C. (2018a). Auditing partisan audience bias within google search. Proceedings of the ACM on Human–Computer Interaction, 2, 148.
Saez-Trumper, D., Castillo, C. & Lalmas, M. (2013). Social media news communities: gatekeeping, coverage, and statement bias. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management (pp. 1679–1684), ACM.
Sandvig, C., Hamilton, K., Karahalios, K. & Langbort, C. (2014). Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: Converting critical concerns into productive inquiry, 22.
Sapiezynski, P., Zeng, W., E Robertson, R., Mislove, A. & Wilson, C. (2019). Quantifying the impact of user attentionon fair group representation in ranked lists. In Companion Proceedings of The 2019 World Wide Web Conference (pp. 553–562).
Sarcona, C. (2019). Organic search click through rates: The numbers never lie. https://www.zerolimitweb.com/organic-vs-ppc-2019-ctr-results-best-practices/. Retrieved 2019-09-06.
Stokes, P. (2019). False media balance. https://www.newphilosopher.com/articles/false-media-balance/. Retrieved 2019-09-15.
Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artificial,. Intelligence.
Tavani, H. (2012). Search engines and ethics.
Vincent, N., Johnson, I., Sheehan, P., & Hecht, B. (2019). Measuring the importance of user-generated content to search engines. Proceedings of the International AAAI Conference on Web and Social Media, 13, 505–516.
Vondrick, C., Patterson, D., & Ramanan, D. (2013). Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 101, 184–204.
White, R. (2013). Beliefs and biases in web search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 3–12), ACM.
Yang, K. & Stoyanovich, J. (2017). Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (p. 22), ACM.
Yigit-Sert, S., Altingovde, I.S. & Ulusoy, Ö. (2016). Towards detecting media bias by utilizing user comments.
Yuen, M. C., King, I. & Leung, K. S. (2011). A survey of crowdsourcing systems. In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing (pp. 766–773), IEEE.
Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M. & Baeza-Yates, R. (2017). Fa* ir: A fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1569–1578), ACM.
Acknowledgements
We thank the reviewers for their comments. This work has been funded by the EPSRC Fellowship titled “Task Based Information Retrieval”, grant reference number EP/P024289/1 and the visiting researcher programme of The Alan Turing Institute.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical standard
Author Emine Yilmaz previously worked as a research consultant for Microsoft Research and she is currently a research consultant for Amazon Research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gezici, G., Lipani, A., Saygin, Y. et al. Evaluation metrics for measuring bias in search engine results. Inf Retrieval J 24, 85–113 (2021). https://doi.org/10.1007/s10791-020-09386-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-020-09386-w