Demographic differences in search engine use with implications for cohort selection

  • Elad Yom-TovEmail author


The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations. Using three datasets comprising of queries from multiple general-purpose Internet search engines we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines. Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 53% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known patient populations. We show that methods for cohort selection which use additional information beyond queries where users indicate their condition are less skewed. Finally, we show that biased training cohorts can lead to differential performance of models designed to detect disease from search engine queries. Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.


Search engines Age Gender Demographics 



  1. Bi, B., Shokouhi, M., Kosinski, M., & Graepel, T. (2013). Inferring the demographics of search users: Social data meets search queries. In Proceedings of the 22nd international conference on World Wide Web (pp. 131–140). ACM.Google Scholar
  2. Diaz, F., Gamon, M., Hofman, J. M., Kıcıman, E., & Rothschild, D. (2016). Online and social media data as an imperfect continuous panel survey. PLoS ONE, 11(1), e0145406.CrossRefGoogle Scholar
  3. Giat, E., & Yom-Tov, E. (2018). Evidence from web-based dietary search patterns to the role of b12 deficiency in non-specific chronic pain: A large-scale observational study. Journal of Medical Internet Research, 20(1), e4.CrossRefGoogle Scholar
  4. Goswami, S., Sarkar, S., & Rustagi, M. (2009). Stylometric analysis of bloggers age and gender. In Third international AAAI conference on weblogs and social media.Google Scholar
  5. Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the Association for Information Science and Technology, 60(1), 9–26.Google Scholar
  6. Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., & Gay, G. (2006). The influence of task and gender on search and evaluation behavior using google. Information Processing & Management, 42(4), 1123–1131.CrossRefGoogle Scholar
  7. Mehrotra, R., Anderson, A., Diaz, F., Sharma, A., Wallach, H., & Yilmaz, E. (2017). Auditing search engines for differential satisfaction across demographics. In Proceedings of the 26th international conference on World Wide Web companion (pp. 626–633). International World Wide Web Conferences Steering Committee.Google Scholar
  8. Newman, M. L., Groom, C. J., Handelman, L. D., & Pennebaker, J. W. (2008). Gender differences in language use: An analysis of 14,000 text samples. Discourse Processes, 45(3), 211–236.CrossRefGoogle Scholar
  9. Ofran, Y., Paltiel, O., Pelleg, D., Rowe, J. M., & Yom-Tov, E. (2012). Patterns of information-seeking for cancer on the internet: An analysis of real world data. PLoS ONE, 7(9), e45921.CrossRefGoogle Scholar
  10. Otterbacher, J. (2010). Inferring gender of movie reviewers: exploiting writing style, content and metadata. In Proceedings of the 19th ACM international conference on information and knowledge management (pp. 369–378). ACM.Google Scholar
  11. Paparrizos, J., White, R. W., & Horvitz, E. (2016). Screening for pancreatic adenocarcinoma using signals from web search logs: Feasibility study and results. Journal of Oncology Practice, 12(8), 737–744.CrossRefGoogle Scholar
  12. Pennebaker, J. W., & Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2), 291.CrossRefGoogle Scholar
  13. Polgreen, P. M., Chen, Y., Pennock, D. M., Nelson, F. D., & Weinstein, R. A. (2008). Using internet searches for influenza surveillance. Clinical Infectious Diseases, 47(11), 1443–1448.CrossRefGoogle Scholar
  14. Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015). An analysis of the user occupational class through twitter content. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 1754–1764).Google Scholar
  15. Purcell, K., Rainie, L., & Brenner, J. (2012) Search engine use 2012.Google Scholar
  16. Rangel, F., & Rosso, P. (2013). Use of language and author profiling: Identification of gender and age. In Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science, NLPCS-2013, Marseille, France, Oct 15–16.Google Scholar
  17. Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., et al. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE, 8(9), e73791.CrossRefGoogle Scholar
  18. Soldaini, L., & Yom-Tov, E. (2017). Inferring individual attributes from search engine queries and auxiliary information. In Proceedings of the 26th international conference on World Wide Web (pp. 293–301). International World Wide Web Conferences Steering Committee.Google Scholar
  19. Song, Y., Ma, H., Wang, H., & Wang, K. (2013). Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd international conference on World Wide Web (pp. 1201–1212). ACM.Google Scholar
  20. Weber, I., & Castillo, C. (2010). The demographics of web search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 523–530). ACM.Google Scholar
  21. Weber, I., & Jaimes, A. (2011). Who uses web search for what: and how. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 15–24). ACM.Google Scholar
  22. Yom-Tov, E. (2016). Crowdsourced health: How what you do on the internet will improve medicine. Cambridge: MIT Press.CrossRefGoogle Scholar
  23. Yom-Tov, E., Borsa, D., Hayward, A. C., McKendry, R. A., & Cox, I. J. (2015). Automatic identification of web-based risk markers for health events. Journal of Medical Internet Research, 17(1), e29.CrossRefGoogle Scholar
  24. Yom-Tov, E., Brunstein-Klomek, A., Mandel, O., Hadas, A., & Fennig, S. (2018). Inducing behavioral change in seekers of pro-anorexia content using internet advertisements: Randomized controlled trial. JMIR Mental Health, 5(1), e6.CrossRefGoogle Scholar
  25. Yom-Tov, E., & Gabrilovich, E. (2013). Postmarket drug surveillance without trial costs: Discovery of adverse drug reactions through large-scale analysis of web search queries. Journal of Medical Internet Research, 15(6), e124.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Microsoft Research IsraelHerzeliyaIsrael
  2. 2.Faculty of Industrial Engineering and ManagementTechnion - Israel Institute of TechnologyHaifaIsrael

Personalised recommendations