Abstract
Aims
Diabetes is often diagnosed late. This study aimed to assess the possibility for earlier detection of diabetes from search data, using predictive models trained on large-scale data.
Methods
We extracted all English-language queries made by people in the USA to Bing during 1 year and identified queries containing symptoms of diabetes. We compared the ability of four different prediction models (linear regression, logistic regression, decision tree and random forest) to distinguish between users who stated that they were diagnosed with diabetes and users who did not refer to diabetes or diabetes drugs but queried about at least one of the symptoms.
Results
We identified 11,050 “new diabetes users” who stated they had been diagnosed with diabetes and approximately 11.5 million “control users” who queried about symptoms without querying for terms related to diabetes. Both the logistic regression and the random forest models were able to distinguish between the populations with an area under curve of 0.92 which translates to a positive predictive value of 56% at a false-positive rate of 1%. The model could identify patients up to 240 days before they mentioned being diagnosed.
Conclusions
Some undiagnosed diabetes patients can be detected accurately according to their symptom queries to a search engine. Such earlier diagnosis, especially in cases of type 1 diabetes, could be clinically meaningful. The ability of search engines to serve as a population-wide screening tool could potentially be improved using additional data provided by users.
Similar content being viewed by others
Availability of data and material
The datasets analyzed during the current study are not publicly available due to privacy and terms of use but are available from the corresponding author on reasonable request.
References
National Diabetes Statistics Report (2017) CDC, Alanta
Rodbard HW, Green AJ, Fox KM, Grandy S (2009) Trends in method of diagnosis of type 2 diabetes mellitus: results from SHIELD. Int J Endocrinol 2009:796206
O’Connor PJ (2006) Diabetes: how are we diagnosing and initially managing it? Ann Fam Med 4(1):15–22
International Diabetes Federation (2017) IDF diabetes atlas, 8th edn. International Diabetes Federation, Brussels
Bertuzzi F et al (2018) Teleconsultation in type 1 diabetes mellitus (TELEDIABE). Acta Diabetol 55(2):185–192
Di Bartolo P, Nicolucci A, Cherubini V, Iafusco D, Scardapane M, Rossi MC (2017) Young patients with type 1 diabetes poorly controlled and poorly compliant with self-monitoring of blood glucose: can technology help? Results of the i-NewTrend randomized clinical trial. Acta Diabetol 54(4):393–402
Yaron M et al (2019) A randomized controlled trial comparing a telemedicine therapeutic intervention with routine care in adults with type 1 diabetes mellitus treated by insulin pumps. Acta Diabetol. https://doi.org/10.1007/s00592-019-01300-1
Fox S, Duggan M (2013) Health online. Pew Research Center, Washington
Yom-Tov E, Gabrilovich E (2013) Postmarket drug surveillance without trial costs: discovery of adverse drug reactions through large-scale analysis of web search queries. J Med Internet Res 15(6):e124
Yom-Tov E, Borsa D, Hayward AC, McKendry RA, Cox IJ (2015) Automatic identification of web-based risk markers for health events. J Med Internet Res 17(1):e29
Soldaini L, Yom-Tov E (2017) Inferring individual attributes from search engine queries and auxiliary information, pp 293–301
White RW, Horvitz E (2017) Evaluation of the feasibility of screening patients for early signs of lung carcinoma in web search logs. JAMA Oncol 3(3):398
White RW, Doraiswamy PM, Horvitz E (2018) Detecting neurodegenerative disorders from web search signals. NPJ Digit Med 1:8. https://doi.org/10.1038/s41746-018-0016-6
Allerhand L, Youngmann B, Yom-Tov E, Arkadir D (2018) Detecting Parkinson’s disease from interactions with a search engine: is expert knowledge sufficient? In: Proceedings of the 27th ACM international conference on information and knowledge management—CIKM’18, Torino, Italy, pp 1539–1542
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Yom-Tov E (2019) Demographic differences in search engine use with implications for cohort selection. Inf Retrieval J. https://doi.org/10.1007/s10791-018-09349-2
Author information
Authors and Affiliations
Contributions
IH conceived the study, interpreted the results, and wrote the paper. EYT conceived the study, collected the data, analyzed it and wrote the paper. DD and SN assisted in analyzing the results and wrote the paper.
Corresponding author
Ethics declarations
Conflict of interest
EYT is an employee of Microsoft, owner of the Bing search engine. Work described herein was performed as part of the author’s salaried employment. The authors declare no conflict of interest.
Ethical approval
This study was approved by the Behavioral Sciences Research Ethics Committee of the Technion, Israel Institute of Technology.
Informed consent
For this type of study formal informed consent is not required.
Additional information
Managed by Antonio Secchi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hochberg, I., Daoud, D., Shehadeh, N. et al. Can internet search engine queries be used to diagnose diabetes? Analysis of archival search data. Acta Diabetol 56, 1149–1154 (2019). https://doi.org/10.1007/s00592-019-01350-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00592-019-01350-5