The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations. Using three datasets comprising of queries from multiple general-purpose Internet search engines we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines. Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 53% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known patient populations. We show that methods for cohort selection which use additional information beyond queries where users indicate their condition are less skewed. Finally, we show that biased training cohorts can lead to differential performance of models designed to detect disease from search engine queries. Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Age groups in this dataset are (in years): 13–17, 18–24, 25–34, 35–49, 50–64, 65 and over.
comScore age groups are (in years): 12, 13–14, 15–17, 18–20, 21–24, 25–34, 35–44, 45–49, 50–54, 55–64, 65 and over.
As provided by either Cancer Research UK (www.cancerresearchuk.org), American Cancer Society (cancer.org) or published in the scientific literature.
Bi, B., Shokouhi, M., Kosinski, M., & Graepel, T. (2013). Inferring the demographics of search users: Social data meets search queries. In Proceedings of the 22nd international conference on World Wide Web (pp. 131–140). ACM.
Diaz, F., Gamon, M., Hofman, J. M., Kıcıman, E., & Rothschild, D. (2016). Online and social media data as an imperfect continuous panel survey. PLoS ONE, 11(1), e0145406.
Giat, E., & Yom-Tov, E. (2018). Evidence from web-based dietary search patterns to the role of b12 deficiency in non-specific chronic pain: A large-scale observational study. Journal of Medical Internet Research, 20(1), e4.
Goswami, S., Sarkar, S., & Rustagi, M. (2009). Stylometric analysis of bloggers age and gender. In Third international AAAI conference on weblogs and social media.
Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the Association for Information Science and Technology, 60(1), 9–26.
Lorigo, L., Pan, B., Hembrooke, H., Joachims, T., Granka, L., & Gay, G. (2006). The influence of task and gender on search and evaluation behavior using google. Information Processing & Management, 42(4), 1123–1131.
Mehrotra, R., Anderson, A., Diaz, F., Sharma, A., Wallach, H., & Yilmaz, E. (2017). Auditing search engines for differential satisfaction across demographics. In Proceedings of the 26th international conference on World Wide Web companion (pp. 626–633). International World Wide Web Conferences Steering Committee.
Newman, M. L., Groom, C. J., Handelman, L. D., & Pennebaker, J. W. (2008). Gender differences in language use: An analysis of 14,000 text samples. Discourse Processes, 45(3), 211–236.
Ofran, Y., Paltiel, O., Pelleg, D., Rowe, J. M., & Yom-Tov, E. (2012). Patterns of information-seeking for cancer on the internet: An analysis of real world data. PLoS ONE, 7(9), e45921.
Otterbacher, J. (2010). Inferring gender of movie reviewers: exploiting writing style, content and metadata. In Proceedings of the 19th ACM international conference on information and knowledge management (pp. 369–378). ACM.
Paparrizos, J., White, R. W., & Horvitz, E. (2016). Screening for pancreatic adenocarcinoma using signals from web search logs: Feasibility study and results. Journal of Oncology Practice, 12(8), 737–744.
Pennebaker, J. W., & Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85(2), 291.
Polgreen, P. M., Chen, Y., Pennock, D. M., Nelson, F. D., & Weinstein, R. A. (2008). Using internet searches for influenza surveillance. Clinical Infectious Diseases, 47(11), 1443–1448.
Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015). An analysis of the user occupational class through twitter content. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 1754–1764).
Purcell, K., Rainie, L., & Brenner, J. (2012) Search engine use 2012.
Rangel, F., & Rosso, P. (2013). Use of language and author profiling: Identification of gender and age. In Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science, NLPCS-2013, Marseille, France, Oct 15–16.
Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., et al. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE, 8(9), e73791.
Soldaini, L., & Yom-Tov, E. (2017). Inferring individual attributes from search engine queries and auxiliary information. In Proceedings of the 26th international conference on World Wide Web (pp. 293–301). International World Wide Web Conferences Steering Committee.
Song, Y., Ma, H., Wang, H., & Wang, K. (2013). Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd international conference on World Wide Web (pp. 1201–1212). ACM.
Weber, I., & Castillo, C. (2010). The demographics of web search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 523–530). ACM.
Weber, I., & Jaimes, A. (2011). Who uses web search for what: and how. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 15–24). ACM.
Yom-Tov, E. (2016). Crowdsourced health: How what you do on the internet will improve medicine. Cambridge: MIT Press.
Yom-Tov, E., Borsa, D., Hayward, A. C., McKendry, R. A., & Cox, I. J. (2015). Automatic identification of web-based risk markers for health events. Journal of Medical Internet Research, 17(1), e29.
Yom-Tov, E., Brunstein-Klomek, A., Mandel, O., Hadas, A., & Fennig, S. (2018). Inducing behavioral change in seekers of pro-anorexia content using internet advertisements: Randomized controlled trial. JMIR Mental Health, 5(1), e6.
Yom-Tov, E., & Gabrilovich, E. (2013). Postmarket drug surveillance without trial costs: Discovery of adverse drug reactions through large-scale analysis of web search queries. Journal of Medical Internet Research, 15(6), e124.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Yom-Tov, E. Demographic differences in search engine use with implications for cohort selection. Inf Retrieval J 22, 570–580 (2019). https://doi.org/10.1007/s10791-018-09349-2
- Search engines