Journal of Digital Imaging

, Volume 28, Issue 5, pp 537–546 | Cite as

Analyzing Medical Image Search Behavior: Semantics and Prediction of Query Results

  • Maria De-ArteagaEmail author
  • Ivan Eggel
  • Charles E. KahnJr.
  • Henning Müller


Log files of information retrieval systems that record user behavior have been used to improve the outcomes of retrieval systems, understand user behavior, and predict events. In this article, a log file of the ARRS GoldMiner search engine containing 222,005 consecutive queries is analyzed. Time stamps are available for each query, as well as masked IP addresses, which enables to identify queries from the same person. This article describes the ways in which physicians (or Internet searchers interested in medical images) search and proposes potential improvements by suggesting query modifications. For example, many queries contain only few terms and therefore are not specific; others contain spelling mistakes or non-medical terms that likely lead to poor or empty results. One of the goals of this report is to predict the number of results a query will have since such a model allows search engines to automatically propose query modifications in order to avoid result lists that are empty or too large. This prediction is made based on characteristics of the query terms themselves. Prediction of empty results has an accuracy above 88 %, and thus can be used to automatically modify the query to avoid empty result sets for a user. The semantic analysis and data of reformulations done by users in the past can aid the development of better search systems, particularly to improve results for novice users. Therefore, this paper gives important ideas to better understand how people search and how to use this knowledge to improve the performance of specialized medical search engines.


Image retrieval Human-computer interaction Machine learning Statistic analysis Information storage and retrieval Medical image search Log file analysis 


  1. 1.
    High-level Expert Group on Scientific Data. Riding the wave: How Europe can gain from the rising tide of scientific data. Submission to the European Commission, available online at, 2010
  2. 2.
    Doi K: Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph 31:198–211, 2007PubMedCentralCrossRefPubMedGoogle Scholar
  3. 3.
    Müller H, Michoux N, Bandon D, Geissbuhler A: A review of content-based image retrieval systems in medicine—clinical benefits and future directions. Int J Med Inform 73:1–23, 2004CrossRefPubMedGoogle Scholar
  4. 4.
    Markonis D, Holzer M, Dungs S, Vargas A, Langs G, Kriewel S, et al: A survey on visual information search behavior and requirements of radiologists. Methods Inf Med 51:539–548, 2012CrossRefPubMedGoogle Scholar
  5. 5.
    Markonis D, Baroz F, de Castaneda RL R, Boyer C, Müller H: User tests for assessing a medical image retrieval system: a pilot study. Stud Health Technol Inf 192:224–228, 2013Google Scholar
  6. 6.
    Jansen BJ, Spink A, Taksai I. Handbook of research on web log analysis. IGI Global, 2009Google Scholar
  7. 7.
    Tsikrika T, Müller H, Kahn Jr, CE: Log analysis to understand medical professionals’ image searching behaviour. Stud Health Technol Inf 180:1020–1024, 2012Google Scholar
  8. 8.
    Yom-Tov E, White RW, Horvitz E: Seeking insights about cycling mood disorders via anonymized search logs. J Med Internet Res 16:e65, 2014PubMedCentralCrossRefPubMedGoogle Scholar
  9. 9.
    Müller H, Boyer C, Gaudinat A, Hersh W, Geissbuhler A: Analyzing web log files of the health on the net HONmedia search engine to define typical image search tasks for image retrieval evaluation. Stud Health Technol Inf 129(Pt 2):1319–1323, 2007Google Scholar
  10. 10.
    Müller H, Kalpathy-Cramer J, Hersh W, Geissbuhler A: Using Medline queries to generate image retrieval tasks for benchmarking. Stud Health Technol Inf 136:523–528, 2008Google Scholar
  11. 11.
    Herskovic JR, Tanaka LY, Hersh W, Bernstam EV: A day in the life of PubMed: analysis of a typical day’s query log. J Am Med Inform Assoc 14:212–220, 2007PubMedCentralCrossRefPubMedGoogle Scholar
  12. 12.
    Islamaj Dogan RI, Murray GC, Névéol A, Lu Z. Understanding PubMed user search behavior through log analysis. Database (Oxford) 2009:bap018, 2009Google Scholar
  13. 13.
    Rubin DL, Flanders A, Kim W, Siddiqui KM, Kahn Jr, CE: Ontology-assisted analysis of web queries to determine the knowledge radiologists seek. J Digit Imaging 24:160–164, 2011PubMedCentralCrossRefPubMedGoogle Scholar
  14. 14.
    Palotti J, Hanbury A, Müller H. Exploiting Health Related Features to Infer User Expertise in the Medical Domain. Web Search Click Data workshop at WSCM, New York City, NY, USA, 2014.Google Scholar
  15. 15.
    Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 22:658–664, 2006CrossRefPubMedGoogle Scholar
  16. 16.
    Kahn Jr, CE, Thao C: GoldMiner: a radiology image search engine. AJR Am J Roentgenol 188:1475–1478, 2008CrossRefGoogle Scholar
  17. 17.
    Silverstein C, Marais H, Henzinger M, Moricz M: Analysis of a very large web search engine query log. SIGIR Forum 33(1):6–12, 1999CrossRefGoogle Scholar
  18. 18.
    Jones R, Klinkner KL. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008. p. 699–708Google Scholar
  19. 19.
    Langlotz CP: RadLex: a new method for indexing online educational materials. RadioGraphics 26:1595–1597, 2006CrossRefPubMedGoogle Scholar
  20. 20.
    Rubin DL: Creating and curating a terminology for radiology: ontology modeling and analysis. J Digit Imaging 21:355–362, 2008PubMedCentralCrossRefPubMedGoogle Scholar
  21. 21.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP: SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 16:321–357, 2002Google Scholar
  22. 22.
    Chang CC, Lin CJ. LIBSVM: a library for support vector machines, 2001Google Scholar
  23. 23.
    Le Cessie S, Van Houwelingen J. Ridge estimators in logistic regression. Applied Statistics. 1992; p. 191–201.Google Scholar
  24. 24.
    Breiman L: Random forests. Mach Learn 45:5–32, 2001CrossRefGoogle Scholar
  25. 25.
    Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med 37:360–363, 2005PubMedGoogle Scholar
  26. 26.
    Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008Google Scholar
  27. 27.
    Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15:1437–1447, 2003CrossRefGoogle Scholar
  28. 28.
    Hollink V, Tsikrika T, de Vries AP: Semantic search log analysis: a method and a study on professional image search. J Am Soc Inf Sci Technol 62:691–713, 2011CrossRefGoogle Scholar
  29. 29.
    Goeuriot L, Kelly L, Li W, Palotti J, Pecina P, Zuccon G, et al. ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred health information retrieval CLEF eHealth overview. In: CLEF Proceedings. Springer LNCS, 2014Google Scholar
  30. 30.
    Seco de Herrera AG, Kalpathy-Cramer J, Demner Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks, CLEF working notes 2013, Valencia, Spain, 2013Google Scholar

Copyright information

© Society for Imaging Informatics in Medicine 2015

Authors and Affiliations

  • Maria De-Arteaga
    • 1
    Email author
  • Ivan Eggel
    • 2
  • Charles E. KahnJr.
    • 3
  • Henning Müller
    • 2
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.HES-SOSierreSwitzerland
  3. 3.University of PennsylvaniaPhiladelphiaUSA

Personalised recommendations