Diabetes is often diagnosed late. This study aimed to assess the possibility for earlier detection of diabetes from search data, using predictive models trained on large-scale data.
We extracted all English-language queries made by people in the USA to Bing during 1 year and identified queries containing symptoms of diabetes. We compared the ability of four different prediction models (linear regression, logistic regression, decision tree and random forest) to distinguish between users who stated that they were diagnosed with diabetes and users who did not refer to diabetes or diabetes drugs but queried about at least one of the symptoms.
We identified 11,050 “new diabetes users” who stated they had been diagnosed with diabetes and approximately 11.5 million “control users” who queried about symptoms without querying for terms related to diabetes. Both the logistic regression and the random forest models were able to distinguish between the populations with an area under curve of 0.92 which translates to a positive predictive value of 56% at a false-positive rate of 1%. The model could identify patients up to 240 days before they mentioned being diagnosed.
Some undiagnosed diabetes patients can be detected accurately according to their symptom queries to a search engine. Such earlier diagnosis, especially in cases of type 1 diabetes, could be clinically meaningful. The ability of search engines to serve as a population-wide screening tool could potentially be improved using additional data provided by users.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Availability of data and material
National Diabetes Statistics Report (2017) CDC, Alanta
Rodbard HW, Green AJ, Fox KM, Grandy S (2009) Trends in method of diagnosis of type 2 diabetes mellitus: results from SHIELD. Int J Endocrinol 2009:796206
O’Connor PJ (2006) Diabetes: how are we diagnosing and initially managing it? Ann Fam Med 4(1):15–22
International Diabetes Federation (2017) IDF diabetes atlas, 8th edn. International Diabetes Federation, Brussels
Bertuzzi F et al (2018) Teleconsultation in type 1 diabetes mellitus (TELEDIABE). Acta Diabetol 55(2):185–192
Di Bartolo P, Nicolucci A, Cherubini V, Iafusco D, Scardapane M, Rossi MC (2017) Young patients with type 1 diabetes poorly controlled and poorly compliant with self-monitoring of blood glucose: can technology help? Results of the i-NewTrend randomized clinical trial. Acta Diabetol 54(4):393–402
Yaron M et al (2019) A randomized controlled trial comparing a telemedicine therapeutic intervention with routine care in adults with type 1 diabetes mellitus treated by insulin pumps. Acta Diabetol. https://doi.org/10.1007/s00592-019-01300-1
Fox S, Duggan M (2013) Health online. Pew Research Center, Washington
Yom-Tov E, Gabrilovich E (2013) Postmarket drug surveillance without trial costs: discovery of adverse drug reactions through large-scale analysis of web search queries. J Med Internet Res 15(6):e124
Yom-Tov E, Borsa D, Hayward AC, McKendry RA, Cox IJ (2015) Automatic identification of web-based risk markers for health events. J Med Internet Res 17(1):e29
Soldaini L, Yom-Tov E (2017) Inferring individual attributes from search engine queries and auxiliary information, pp 293–301
White RW, Horvitz E (2017) Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncol 3(3):398
White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from web search signals. NPJ Digit Med 1:8. https://doi.org/10.1038/s41746-018-0016-6
Allerhand L, Youngmann B, Yom-Tov E, Arkadir D (2018) Detecting Parkinson’s disease from interactions with a search engine: is expert knowledge sufficient? In: Proceedings of the 27th ACM international conference on information and knowledge management—CIKM’18, Torino, Italy, pp 1539–1542
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Yom-Tov E (2019) Demographic differences in search engine use with implications for cohort selection. Inf Retrieval J. https://doi.org/10.1007/s10791-018-09349-2
Conflict of interest
EYT is an employee of Microsoft, owner of the Bing search engine. Work described herein was performed as part of the author’s salaried employment. The authors declare no conflict of interest.
This study was approved by the Behavioral Sciences Research Ethics Committee of the Technion, Israel Institute of Technology.
For this type of study formal informed consent is not required.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Managed by Antonio Secchi.
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Hochberg, I., Daoud, D., Shehadeh, N. et al. Can internet search engine queries be used to diagnose diabetes? Analysis of archival search data. Acta Diabetol 56, 1149–1154 (2019). https://doi.org/10.1007/s00592-019-01350-5
- Digital health