Analyzing Medical Image Search Behavior: Semantics and Prediction of Query Results


Log files of information retrieval systems that record user behavior have been used to improve the outcomes of retrieval systems, understand user behavior, and predict events. In this article, a log file of the ARRS GoldMiner search engine containing 222,005 consecutive queries is analyzed. Time stamps are available for each query, as well as masked IP addresses, which enables to identify queries from the same person. This article describes the ways in which physicians (or Internet searchers interested in medical images) search and proposes potential improvements by suggesting query modifications. For example, many queries contain only few terms and therefore are not specific; others contain spelling mistakes or non-medical terms that likely lead to poor or empty results. One of the goals of this report is to predict the number of results a query will have since such a model allows search engines to automatically propose query modifications in order to avoid result lists that are empty or too large. This prediction is made based on characteristics of the query terms themselves. Prediction of empty results has an accuracy above 88 %, and thus can be used to automatically modify the query to avoid empty result sets for a user. The semantic analysis and data of reformulations done by users in the past can aid the development of better search systems, particularly to improve results for novice users. Therefore, this paper gives important ideas to better understand how people search and how to use this knowledge to improve the performance of specialized medical search engines.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

    CF: clinical findings, O: object, AE: anatomical entity, NS: non-anatomical substance, RD: RadLex descriptor, PP: property, P: procedure, PS: procedure step, IO: imaging observation, IM: imaging modality, RC: report component, R: report, PC: process.

  7. 7.


  1. 1.

    High-level Expert Group on Scientific Data. Riding the wave: How Europe can gain from the rising tide of scientific data. Submission to the European Commission, available online at, 2010

  2. 2.

    Doi K: Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph 31:198–211, 2007

    PubMed Central  Article  PubMed  Google Scholar 

  3. 3.

    Müller H, Michoux N, Bandon D, Geissbuhler A: A review of content-based image retrieval systems in medicine—clinical benefits and future directions. Int J Med Inform 73:1–23, 2004

    Article  PubMed  Google Scholar 

  4. 4.

    Markonis D, Holzer M, Dungs S, Vargas A, Langs G, Kriewel S, et al: A survey on visual information search behavior and requirements of radiologists. Methods Inf Med 51:539–548, 2012

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Markonis D, Baroz F, de Castaneda RL R, Boyer C, Müller H: User tests for assessing a medical image retrieval system: a pilot study. Stud Health Technol Inf 192:224–228, 2013

    Google Scholar 

  6. 6.

    Jansen BJ, Spink A, Taksai I. Handbook of research on web log analysis. IGI Global, 2009

  7. 7.

    Tsikrika T, Müller H, Kahn Jr, CE: Log analysis to understand medical professionals’ image searching behaviour. Stud Health Technol Inf 180:1020–1024, 2012

    Google Scholar 

  8. 8.

    Yom-Tov E, White RW, Horvitz E: Seeking insights about cycling mood disorders via anonymized search logs. J Med Internet Res 16:e65, 2014

    PubMed Central  Article  PubMed  Google Scholar 

  9. 9.

    Müller H, Boyer C, Gaudinat A, Hersh W, Geissbuhler A: Analyzing web log files of the health on the net HONmedia search engine to define typical image search tasks for image retrieval evaluation. Stud Health Technol Inf 129(Pt 2):1319–1323, 2007

    Google Scholar 

  10. 10.

    Müller H, Kalpathy-Cramer J, Hersh W, Geissbuhler A: Using Medline queries to generate image retrieval tasks for benchmarking. Stud Health Technol Inf 136:523–528, 2008

    Google Scholar 

  11. 11.

    Herskovic JR, Tanaka LY, Hersh W, Bernstam EV: A day in the life of PubMed: analysis of a typical day’s query log. J Am Med Inform Assoc 14:212–220, 2007

    PubMed Central  Article  PubMed  Google Scholar 

  12. 12.

    Islamaj Dogan RI, Murray GC, Névéol A, Lu Z. Understanding PubMed user search behavior through log analysis. Database (Oxford) 2009:bap018, 2009

  13. 13.

    Rubin DL, Flanders A, Kim W, Siddiqui KM, Kahn Jr, CE: Ontology-assisted analysis of web queries to determine the knowledge radiologists seek. J Digit Imaging 24:160–164, 2011

    PubMed Central  Article  PubMed  Google Scholar 

  14. 14.

    Palotti J, Hanbury A, Müller H. Exploiting Health Related Features to Infer User Expertise in the Medical Domain. Web Search Click Data workshop at WSCM, New York City, NY, USA, 2014.

  15. 15.

    Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 22:658–664, 2006

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Kahn Jr, CE, Thao C: GoldMiner: a radiology image search engine. AJR Am J Roentgenol 188:1475–1478, 2008

    Article  Google Scholar 

  17. 17.

    Silverstein C, Marais H, Henzinger M, Moricz M: Analysis of a very large web search engine query log. SIGIR Forum 33(1):6–12, 1999

    Article  Google Scholar 

  18. 18.

    Jones R, Klinkner KL. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008. p. 699–708

  19. 19.

    Langlotz CP: RadLex: a new method for indexing online educational materials. RadioGraphics 26:1595–1597, 2006

    Article  PubMed  Google Scholar 

  20. 20.

    Rubin DL: Creating and curating a terminology for radiology: ontology modeling and analysis. J Digit Imaging 21:355–362, 2008

    PubMed Central  Article  PubMed  Google Scholar 

  21. 21.

    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP: SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 16:321–357, 2002

    Google Scholar 

  22. 22.

    Chang CC, Lin CJ. LIBSVM: a library for support vector machines, 2001

  23. 23.

    Le Cessie S, Van Houwelingen J. Ridge estimators in logistic regression. Applied Statistics. 1992; p. 191–201.

  24. 24.

    Breiman L: Random forests. Mach Learn 45:5–32, 2001

    Article  Google Scholar 

  25. 25.

    Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med 37:360–363, 2005

    PubMed  Google Scholar 

  26. 26.

    Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008

  27. 27.

    Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15:1437–1447, 2003

    Article  Google Scholar 

  28. 28.

    Hollink V, Tsikrika T, de Vries AP: Semantic search log analysis: a method and a study on professional image search. J Am Soc Inf Sci Technol 62:691–713, 2011

    Article  Google Scholar 

  29. 29.

    Goeuriot L, Kelly L, Li W, Palotti J, Pecina P, Zuccon G, et al. ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred health information retrieval CLEF eHealth overview. In: CLEF Proceedings. Springer LNCS, 2014

  30. 30.

    Seco de Herrera AG, Kalpathy-Cramer J, Demner Fushman D, Antani S, Müller H. Overview of the ImageCLEF 2013 medical tasks, CLEF working notes 2013, Valencia, Spain, 2013

Download references

Author information



Corresponding author

Correspondence to Maria De-Arteaga.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

De-Arteaga, M., Eggel, I., Kahn, C.E. et al. Analyzing Medical Image Search Behavior: Semantics and Prediction of Query Results. J Digit Imaging 28, 537–546 (2015).

Download citation


  • Image retrieval
  • Human-computer interaction
  • Machine learning
  • Statistic analysis
  • Information storage and retrieval
  • Medical image search
  • Log file analysis