Skip to main content

Payoffs and pitfalls in using knowledge-bases for consumer health search

Abstract

Consumer health search (CHS) is a challenging domain with vocabulary mismatch and considerable domain expertise hampering peoples’ ability to formulate effective queries. We posit that using knowledge bases for query reformulation may help alleviate this problem. How to exploit knowledge bases for effective CHS is nontrivial, involving a swathe of key choices and design decisions (many of which are not explored in the literature). Here we rigorously empirically evaluate the impact these different choices have on retrieval effectiveness. A state-of-the-art knowledge-base retrieval model—the Entity Query Feature Expansion model—was used to evaluate these choices, which include: which knowledge base to use (specialised vs. general purpose), how to construct the knowledge base, how to extract entities from queries and map them to entities in the knowledge base, what part of the knowledge base to use for query expansion, and if to augment the knowledge base search process with relevance feedback. While knowledge base retrieval has been proposed as a solution for CHS, this paper delves into the finer details of doing this effectively, highlighting both payoffs and pitfalls. It aims to provide some lessons to others in advancing the state-of-the-art in CHS.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences.

  2. http://conceptnet.io/c/en/insomnia. Last visited 30/04/2018.

  3. https://sleepfoundation.org/insomnia/content/what-causes-insomnia. Last visited 30/04/2018.

  4. A Wikipedia Infobox is used to summarise important aspects of an entity and its relation with other articles.

  5. http://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes#Health_and_fitness.

  6. A Wikipedia Infobox is used to summarise important aspects of an entity and its relation with other articles.

  7. http://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes#Health_and_fitness.

  8. Only complete string matches were considered.

  9. ECNU-2 had the highest effectiveness, but it used Google query suggestion service to gain expansions.

References

  • Aronson, A. R., & Lang, F. M. (2010). An overview of metamap: Historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3), 229–236.

    Article  Google Scholar 

  • Balaneshinkordan, S., & Kotov, A. (2016). An empirical comparison of term association and knowledge graphs for query expansion. In European conference on information retrieval (pp 761–767). Berlin: Springer.

  • Bendersky, M., Metzler, D., & Croft, W, (2012), Effective query formulation with multiple information sources. In Proceedings of the 5th ACM international conference on web search and data mining (pp. 443–452).

  • Bodenreider, O. (2004). The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(suppl 1), D267–D270.

    Article  Google Scholar 

  • Dalton, J., Dietz, L., & Allan, J. (2014). Entity query feature expansion using knowledge base links. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval (pp. 365–374).

  • Díaz-Galiano, M., Martín-Valdivia, M., & Ureña-López, L. (2009). Query expansion with a medical ontology to improve a multimodal information retrieval system. Journal of Computers in Biology and Medicine, 39(4), 396–403.

    Article  Google Scholar 

  • Egozi, O., Markovitch, S., & Gabrilovich, E. (2011). Concept-based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems (TOIS), 29(2), 8.

    Article  Google Scholar 

  • Fox, S., & Duggan, M. (2013). Health online 2013. Technical report. http://www.pewinternet.org/2013/01/15/health-online-2013/. Accessed 30 Oct 2018.

  • Jimmy, Zuccon, G., & Koopman, B. (2016). Boosting titles does not generally improve retrieval effectiveness. In Proceedings of the 21st Australasian document computing symposium (pp. 25–32).

  • Jimmy, Zuccon, G., & Koopman, B. (2017). Qut ielab at clef 2017 e-health IR task: Knowledge base retrieval for consumer health search. In CLEF.

  • Jimmy, Zuccon, G., & Koopman, B. (2018). Choices in knowledge-base retrieval for consumer health search. In Proceedings of the 40th European conference on information retrieval. Berlin: Springer.

  • Keselman, A., Smith, C. A., Divita, G., Kim, H., Browne, A. C., Leroy, G., et al. (2008). Consumer health concepts that do not map to the UMLS: Where do they fit? Journal of the American Medical Informatics Association, 15(4), 496–505.

    Article  Google Scholar 

  • Keselman, A., Tse, T., Crowell, J., Browne, A., Ngo, L., & Zeng, Q. (2006). Relating consumer knowledge of health terms and health concepts. In Proceedings of American medical informatics association.

  • Koopman, B., Zuccon, G., Bruza, P., Sitbon, L., & Lawley, M. (2012). Graph-based concept weighting for medical information retrieval. In Proceedings of the 17th Australasian document computing symposium (pp. 80–87).

  • Kotov, A., & Zhai, C. (2012). Tapping into knowledge base for concept feedback: Leveraging concept net to improve search results for difficult queries. In Proceedings of the 5th ACM international conference on web search and data mining, ACM (pp. 403–412).

  • Limsopatham, N., Macdonald, C., & Ounis, I. (2013). Inferring conceptual relationships to improve medical records search. In Proceedings of the 10th conference on open research areas in information retrieval (pp. 1–8).

  • Liu, X., & Fang, H. (2015). Latent entity space: A novel retrieval approach for entity-bearing queries. Information Retrieval Journal, 18(6), 473–503.

    Article  Google Scholar 

  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208.

    Article  Google Scholar 

  • McDaid, D., & Park, A. L. (2011). Online health: Untangling the web. Technical report. https://www.bupa.com.au/staticfiles/Bupa/HealthAndWellness/MediaFiles/PDF/LSE_Report_Online_Health.pdf. Accessed 30 Oct 2018.

  • Palotti, J., Goeuriot, L., Zuccon, G., & Hanbury, A. (2016). Ranking health web pages with relevance and understandability. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval (pp. 965–968).

  • Palotti, J., Zuccon, G., Jimmy, Pecina, P., Lupu, M., Goeuriot, L., Kelly, L., & Hanbury, A. (2017). Clef 2017 task overview: The IR task at the ehealth evaluation lab. In Working notes of conference and labs of the evaluation (CLEF) forum. CEUR workshop proceedings.

  • Plovnick, R., & Zeng, Q. (2004). Reformulation of consumer health queries with professional terminology: A pilot study. Journal of Medical Internet Research, 6(3), e27.

    Article  Google Scholar 

  • Sakai, T. (2007). Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07 (pp. 71–78). New York: ACM.

  • Silva, R., & Lopes, C. (2016). The effectiveness of query expansion when searching for health related content: Infolab at clef ehealth 2016. In CLEF (working notes).

  • Soldaini, L., Cohan, A., Yates, A., Goharian, N., & Frieder, O. (2015). Retrieving medical literature for clinical decision support. In European conference on information retrieval (pp 538–549). Berlin: Springer.

  • Soldaini, L., & Goharian, N. (2016). QuickUMLS: A fast, unsupervised approach for medical concept extraction. In SIGIR MedIR workshop, Pisa, Italy.

  • Soldaini, L., & Goharian, N. (2017). Learning to rank for consumer health search: A semantic approach. In European conference on information retrieval (pp 640–646). Berlin: Springer.

  • Soldaini, L., Yates, A., Yom-Tov, E., Frieder, O., & Goharian, N. (2016). Enhancing web search in the medical domain via query clarification. Information Retrieval Journal, 19(1–2), 149–173.

    Article  Google Scholar 

  • Stanton, I., Ieong, S., & Mishra, N. (2014). Circumlocution in diagnostic medical queries. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 133–142).

  • Toms, E., & Latter, C. (2007). How consumers search for health information. Health Informatics Journal, 13(3), 223–235.

    Article  Google Scholar 

  • Xiong, C., & Callan, J. (2015). Query expansion with freebase. In Proceedings of the 2015 international conference on the theory of information retrieval, ACM (pp. 111–120).

  • Zeng, Q., Kogan, S., Ash, N., Greenes, R., & Boxwala, A. (2002). Characteristics of consumer terminology for health information retrieval. Methods of Information in Medicine-Methodik der Information in der Medizin, 41(4), 289–298.

    Article  Google Scholar 

  • Zeng, Q. T., Crowell, J., Plovnick, R. M., Kim, E., Ngo, L., & Dibble, E. (2006). Assisting consumer health information retrieval with query recommendations. Journal of the American Medical Informatics Association, 13(1), 80–90.

    Article  Google Scholar 

  • Zeng, Q. T., & Tse, T. (2006). Exploring and developing consumer health vocabularies. Journal of the American Medical Informatics Association, 13(1), 24–29.

    Article  Google Scholar 

  • Zhang, Y. (2014). Searching for specific health-related information in MedlinePlus: Behavioral patterns and user experience. Journal of the Association for Information Science and Technology, 65(1), 53–68.

    Article  Google Scholar 

  • Zuccon, G., Koopman, B., Nguyen, A., Vickers, D., & Butt, L. (2012). Exploiting medical hierarchies for concept-based information retrieval. In Proceedings of the 17th Australasian document computing symposium (pp. 111–114).

  • Zuccon, G., Koopman, B., & Palotti, J. (2015). Diagnose this if you can: On the effectiveness of search engines in finding medical self-diagnosis information. In European conference on information retrieval MedIR’15 (pp. 562–567).

  • Zuccon, G., Palotti, J., Goeuriot, L., Kelly, L., Lupu, M., Pecina, P., Mueller, H., Budaher, J., & Deacon, A. (2016). The IR task at the CLEF eHealth evaluation lab 2016: User-centred health information retrieval. In CLEF 2016-conference and labs of the evaluation forum.

Download references

Acknowledgements

Jimmy is sponsored by the Indonesia Endowment Fund for Education (Lembaga Pengelola Dana Pendidikan/LPDP). Guido Zuccon is the recipient of an Australian Research Council DECRA Research Fellowship (DE180101579) and a Google Faculty Research Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guido Zuccon.

Appendices

Appendix 1: Statistical significance analysis

See Tables 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 and 26.

Table 16 Statistical significance analysis for results in Table 3: Choice 1. n, b, and r mark statistical significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 17 Statistical significance analysis for results in Table 4: Choice 2. n, b, and r mark statistical significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 18 Statistical significance analysis for results in Table 5 (top): Choice 3 - all queries set. n, b, and r mark statistical significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 19 Statistical significance analysis for results in Table 5 (bottom): Choice 3 - high coverage queries set. n, b, and r mark statistically significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 20 Statistical significance analysis for results in Table 6 Choice 4. n, b, and r mark statistically significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 21 Statistical significance analysis for results in Table 7 (top): Choice 5 - All queries set. n, b, and r mark statistically significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 22 Statistical significance analysis for results in Table 7 (bottom): Choice 5 - high coverage queries set. n, b, and r mark statistically significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 23 Statistical significance analysis for results for CLEF 2015 obtained using the best settings on CLEF2016 in Table 12. n, b, and r mark statistically significant differences (pairwise t-test with Bonferroni correction, \({p} < 0.05\)) for nDCG@10, bpref, and RBP@10, respectively
Table 24 Statistical significance between results of the CLEF2016’s best settings using CLEF2016-2017 validation data in Table 13 (top): the all queries set. n, b, and r show statistically significant (\(\hbox {pairwise bonferroni} < 0.05\)) for nDCG@10, bpref, and RBP@10 measure, respectively
Table 25 Statistical significance between results of the CLEF2016’s best settings using CLEF2016-2017 validation data in Table 13 (bottom): the high coverage queries set. n, b, and r show statistically significant (\(\hbox {pairwise bonferroni} < 0.05\)) for nDCG@10, bpref, and RBP@10 measure, respectively
Table 26 Statistical significance between results using condensed evaluation in Table 14. p, m, n, and r show statistically significant (\(\hbox {pairwise bonferroni} < 0.05\)) for P@10, map, nDCG@10, and RBP@10 measure, respectively

Appendix 2: List of abbreviations

 

Abbreviation

Definition

General

 CHS

Consumer health search

 CHV

Consumer health vocabulary

 EQFE

Entity query feature expansion

 HT

Health term

 IR

Information retrieval

 KB

Knowledge base

Methods

 CC

CHV Construction

 CEM

CHV entity mapping

 CME

CHV mention extraction

 CSE

CHV source of expansion

 EM

Entity mapping

 ME

Mention extraction

 PRF

Pseudo relevance feedback

 PRFHT

Pseudo relevance feedback health term

 RF

Relevance feedback

 RFHT

Relevance feedback health term

 SE

Source of expansion

 UC

UMLS construction

 UEM

UMLS entity mapping

 UME

UMLS mention extraction

 UMLS

Unified medical language system

 USE

UMLS source of expansion

 WC

Wikipedia construction

 WEM

Wikipedia entity mapping

 WME

Wikipedia mention extraction

 WSE

Wikipedia source of expansion

Measures

 \({<}{\hbox {e,g,l}}{>}\)

<Number of expanded queries, queries with gain, queries with loss>

 \(\overline{|exp|}\)

The average number of terms added in the expanded query

 bpref

Binary preference

 MAP

Mean average precision

 nDCG@10

Normalised discounted cumulative gain at rank 10

 P@10

Precission at rank 10

 RBP@10

Rank-biased precision at rank 10

 Res.

Residual of the rank-biased precision

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jimmy, Zuccon, G. & Koopman, B. Payoffs and pitfalls in using knowledge-bases for consumer health search. Inf Retrieval J 22, 350–394 (2019). https://doi.org/10.1007/s10791-018-9344-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10791-018-9344-z

Keywords