Can internet search engine queries be used to diagnose diabetes? Analysis of archival search data



Diabetes is often diagnosed late. This study aimed to assess the possibility for earlier detection of diabetes from search data, using predictive models trained on large-scale data.


We extracted all English-language queries made by people in the USA to Bing during 1 year and identified queries containing symptoms of diabetes. We compared the ability of four different prediction models (linear regression, logistic regression, decision tree and random forest) to distinguish between users who stated that they were diagnosed with diabetes and users who did not refer to diabetes or diabetes drugs but queried about at least one of the symptoms.


We identified 11,050 “new diabetes users” who stated they had been diagnosed with diabetes and approximately 11.5 million “control users” who queried about symptoms without querying for terms related to diabetes. Both the logistic regression and the random forest models were able to distinguish between the populations with an area under curve of 0.92 which translates to a positive predictive value of 56% at a false-positive rate of 1%. The model could identify patients up to 240 days before they mentioned being diagnosed.


Some undiagnosed diabetes patients can be detected accurately according to their symptom queries to a search engine. Such earlier diagnosis, especially in cases of type 1 diabetes, could be clinically meaningful. The ability of search engines to serve as a population-wide screening tool could potentially be improved using additional data provided by users.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Availability of data and material

The datasets analyzed during the current study are not publicly available due to privacy and terms of use but are available from the corresponding author on reasonable request.


  1. 1.

    National Diabetes Statistics Report (2017) CDC, Alanta

  2. 2.

    Rodbard HW, Green AJ, Fox KM, Grandy S (2009) Trends in method of diagnosis of type 2 diabetes mellitus: results from SHIELD. Int J Endocrinol 2009:796206

    Article  PubMed  PubMed Central  Google Scholar 

  3. 3.

    O’Connor PJ (2006) Diabetes: how are we diagnosing and initially managing it? Ann Fam Med 4(1):15–22

    Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    International Diabetes Federation (2017) IDF diabetes atlas, 8th edn. International Diabetes Federation, Brussels

    Google Scholar 

  5. 5.

    Bertuzzi F et al (2018) Teleconsultation in type 1 diabetes mellitus (TELEDIABE). Acta Diabetol 55(2):185–192

    Article  PubMed  Google Scholar 

  6. 6.

    Di Bartolo P, Nicolucci A, Cherubini V, Iafusco D, Scardapane M, Rossi MC (2017) Young patients with type 1 diabetes poorly controlled and poorly compliant with self-monitoring of blood glucose: can technology help? Results of the i-NewTrend randomized clinical trial. Acta Diabetol 54(4):393–402

    Article  CAS  PubMed  Google Scholar 

  7. 7.

    Yaron M et al (2019) A randomized controlled trial comparing a telemedicine therapeutic intervention with routine care in adults with type 1 diabetes mellitus treated by insulin pumps. Acta Diabetol.

    Article  PubMed  Google Scholar 

  8. 8.

    Fox S, Duggan M (2013) Health online. Pew Research Center, Washington

    Google Scholar 

  9. 9.

    Yom-Tov E, Gabrilovich E (2013) Postmarket drug surveillance without trial costs: discovery of adverse drug reactions through large-scale analysis of web search queries. J Med Internet Res 15(6):e124

    Article  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Yom-Tov E, Borsa D, Hayward AC, McKendry RA, Cox IJ (2015) Automatic identification of web-based risk markers for health events. J Med Internet Res 17(1):e29

    Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Soldaini L, Yom-Tov E (2017) Inferring individual attributes from search engine queries and auxiliary information, pp 293–301

  12. 12.

    White RW, Horvitz E (2017) Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncol 3(3):398

    Article  PubMed  Google Scholar 

  13. 13.

    White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from web search signals. NPJ Digit Med 1:8.

    Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Allerhand L, Youngmann B, Yom-Tov E, Arkadir D (2018) Detecting Parkinson’s disease from interactions with a search engine: is expert knowledge sufficient? In: Proceedings of the 27th ACM international conference on information and knowledge management—CIKM’18, Torino, Italy, pp 1539–1542

  15. 15.

    Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York

    Google Scholar 

  16. 16.

    Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  CAS  PubMed  Google Scholar 

  17. 17.

    Yom-Tov E (2019) Demographic differences in search engine use with implications for cohort selection. Inf Retrieval J.

    Article  Google Scholar 

Download references

Author information




IH conceived the study, interpreted the results, and wrote the paper. EYT conceived the study, collected the data, analyzed it and wrote the paper. DD and SN assisted in analyzing the results and wrote the paper.

Corresponding author

Correspondence to Irit Hochberg.

Ethics declarations

Conflict of interest

EYT is an employee of Microsoft, owner of the Bing search engine. Work described herein was performed as part of the author’s salaried employment. The authors declare no conflict of interest.

Ethical approval

This study was approved by the Behavioral Sciences Research Ethics Committee of the Technion, Israel Institute of Technology.

Informed consent

For this type of study formal informed consent is not required.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Managed by Antonio Secchi.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 15 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hochberg, I., Daoud, D., Shehadeh, N. et al. Can internet search engine queries be used to diagnose diabetes? Analysis of archival search data. Acta Diabetol 56, 1149–1154 (2019).

Download citation


  • Diabetes
  • Symptoms
  • Digital health
  • Internet