Payoffs and pitfalls in using knowledge-bases for consumer health search

Jimmy; Zuccon, Guido; Koopman, Bevan

doi:10.1007/s10791-018-9344-z

Payoffs and pitfalls in using knowledge-bases for consumer health search

Knowledge Graphs and Semantics in Text Analysis and Retrieval
Published: 08 November 2018

Volume 22, pages 350–394, (2019)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Payoffs and pitfalls in using knowledge-bases for consumer health search

Download PDF

928 Accesses
5 Citations
5 Altmetric
Explore all metrics

Abstract

Consumer health search (CHS) is a challenging domain with vocabulary mismatch and considerable domain expertise hampering peoples’ ability to formulate effective queries. We posit that using knowledge bases for query reformulation may help alleviate this problem. How to exploit knowledge bases for effective CHS is nontrivial, involving a swathe of key choices and design decisions (many of which are not explored in the literature). Here we rigorously empirically evaluate the impact these different choices have on retrieval effectiveness. A state-of-the-art knowledge-base retrieval model—the Entity Query Feature Expansion model—was used to evaluate these choices, which include: which knowledge base to use (specialised vs. general purpose), how to construct the knowledge base, how to extract entities from queries and map them to entities in the knowledge base, what part of the knowledge base to use for query expansion, and if to augment the knowledge base search process with relevance feedback. While knowledge base retrieval has been proposed as a solution for CHS, this paper delves into the finer details of doing this effectively, highlighting both payoffs and pitfalls. It aims to provide some lessons to others in advancing the state-of-the-art in CHS.

Choices in Knowledge-Base Retrieval for Consumer Health Search

Focused Query Expansion with Entity Cores for Patient-Centric Health Search

A Compound Model for Consumer Health Search

1 Introduction

A major challenge for users in consumer health search (CHS) is how to effectively represent complex and ambiguous information needs as a query (Zhang 2014; Toms and Latter 2007; Zeng et al. 2002). Studies on query formulation in CHS have shown that consumers struggle to find effective query terms (Zeng et al. 2002), often submitting layman and circumlocutory descriptions of symptoms instead of precise medical terms (Stanton et al. 2014; Zuccon et al. 2015). For example, people search for “skin irregularities” instead of “skin lesions” (the correct medical term for the symptom). They do so using general web search engines, which are commonly preferred over specialised health web sites and services (Fox and Duggan 2013; McDaid and Park 2011). However, previous work has shown that the use of general web search engines for answering these specific health needs leads to poor retrieval effectiveness, incorrect information and possibly low user satisfaction (Zuccon et al. 2015). Different approaches have been proposed to improve CHS, including query suggestion (Zeng et al. 2006), learning-to-rank using syntactic, semantic or readability features (Soldaini and Goharian 2017; Palotti et al. 2016), and query expansion or reformulation (Soldaini et al. 2016; Silva and Lopes 2016; Plovnick and Zeng 2004).

Here we focus on overcoming the problems in CHS by expanding a health query with more effective terms (e.g., less ambiguous, synonyms, etc.). For example, the query “skin tag” can be expanded by adding the term “acrochordon” which is a medical term for skin tag. The term “acrochordon” provides better disambiguation as it effectively represents the original two terms query. Documents containing the term “acrochordon” are more likely to be relevant to the query than documents containing either “skin” or “tag” alone.

A valuable source of medical domain knowledge is contained in carefully curated medical knowledge bases (KBs); for example, the UMLS medical thesaurus.^{Footnote 1} Studies have shown that manually replacing query terms with those from medical knowledge bases has proven effective (Plovnick and Zeng 2004)—but can it be done automatically?

How to effectively utilise the KB to improve retrieval involves a large number of important design decisions. The impact of these different decisions has not been thoroughly and rigorously considered in most previous approaches (Bendersky et al. 2012; Dalton et al. 2014). Thus, in this paper, we also seek to empirically evaluate the impact of a number of different choices in KB retrieval.

Key contributions

The implementation and evaluation of a state-of-the-art knowledge base retrieval method to consumer health search;
The impact of implementation choices, including: (i) KB construction; (ii) entity mention extraction; (iii) entity mapping; (iv) source of expansion; (v) use of relevance feedback. We also determine whether the use of a specialised KB is preferred over a general purpose one, or vice versa.

While some of this material is covered in an existing study (Jimmy et al. 2018), this articles includes the following additional contributions:

An extended literature review highlighting key works that have proposed methods to exploit knowledge bases and knowledge graphs for query expansion, both within and outside health search.
An expanded explanation of the methods by integrating a meaningful example that aids the understanding of the key differences produced by each considered choice in the KB query expansion process.
The addition of the Consumer Health Vocabulary (CHV) as another knowledge base (Choice 1). CHV provides a mapping between professional medical lingo and consumer expressions (Zeng and Tse 2006; Keselman et al. 2008).
The extraction of query entity mentions (Choice 2) using Metamap (Aronson and Lang 2010) (a biomedical information extraction system).
A study of combining expansion terms from all KBs (Wikipedia, UMLS and CHV) when considering the source of expansion for term selection (Choice 4).
An evaluation of an alternative approach for relevance feedback and pseudo relevance feedback (Choice 5) based on Soldaini et al. (2015)’s work, which filters expansion terms based on their likelihood of being health related.
An investigation of results generalisability via evaluation using an additional test collection, CLEF eHealth 2015, which used different queries and different websites crawls.
An analysis of the influence of unjudged documents on retrieval results, including evaluating using the combined relevance assessments from CLEF 2016 and CLEF 2017, and using condensed list approach (Sakai 2007).

The remainder of this paper is structured as follows. Section 2 discusses previous work related to this article. Section 3 describes the query expansion model used and the choices we consider for knowledge base retrieval. Section 4 explains the data collection used in this work. Section 5 details the empirical evaluation performed and the evaluation results. Section 6 analyses and discusses the evaluation results, while Sect. 7 concludes this article. Additionally, “Appendix 1: Statistical significance analysis” section reports the statistical significance analysis for all the results of the experiments discussed in this article, and “Appendix 2: List of abbreviations” section lists the abbreviations used to provide the reader with a quick-to-consult reference.

2 Related work

2.1 Knowledge-base retrieval

Knowledge bases such as Wikipedia and Freebase have been used to automatically improve retrieval effectiveness by augmenting user-issued queries. We start by introducing the method we rely on in this article: the Entity Query Feature Expansion (EQFE) (Dalton et al. 2014) (The actual formulation of the method is detailed in Sect. 3.1). This model performs automated query expansion by linking mentions from the original query to concepts in Wikipedia. Instead of achieving this through a direct mapping (as we later show Bendersky et al. (2012) did), the Entity Query Feature Expansion model labels words in the query and in each document with a set of entity mentions $M_Q$ and $M_d$ (Dalton et al. 2014). Each entity mention is related to KB entities $e\ \epsilon \ E$, with different relationship types. Queries are then expanded by including entity aliases, categories, words, and types from their related Wikipedia articles. The expanded queries are then matched against documents in the corpus using the query likelihood model with Dirichlet smoothing.

We posit that this Entity Query Feature Expansion model is a natural fit for consumer health search. It provides a means of mapping health queries to health entities in a health related (subset of a) KB, be this either a general purpose KB (e.g., Wikipedia) or a domain-specific KB (e.g., UMLS). The initial query can then be expanded based on related entities. In this article, we investigated the use of both a specialised health KB, in line with previous work that expanded queries using, e.g., MeSH or UMLS (Soldaini et al. 2016; Díaz-Galiano et al. 2009; Silva and Lopes 2016), and a general purpose KB, Wikipedia. Our rationale for this latter choice was the observation that consumers tend to submit queries using general terms and that these are covered by Wikipedia entities. However, Wikipedia also covers many of the medical entities found in specialised medical KBs. More importantly, there are links between the general and specialised entities in Wikipedia—links that can be exploited for query expansion. For the same reason, we have further extended the choices we investigated for KB construction by also considering the consumer health vocabulary (CHV), which, like Wikipedia, provides a direct link between professional lingo and consumer expressions (e.g. “myocardial infarction” $\Rightarrow $ “heart attack”); however, unlike Wikipedia, CHV does this explicitly, rather than implicitly. Thus, we adopted the Entity Query Feature Expansion model for our empirical evaluation, determining if such a KB retrieval approach is effective for CHS.

Other methods for knowledge base retrieval do exist: next we provide a brief account of selected methods used for KB retrieval.

For example, Bendersky et al. (2012) proposed a query formulation approach that links queries to concepts in multiple information sources such as Wikipedia, query logs, and the retrieval corpus itself, using pseudo-relevance feedback. First, they weighted concepts from the query by considering the frequency of each concept found in Google N-grams, MSN Query log , Wikipedia Titles, and the retrieval corpus. Then, a large pool of candidate expansion terms was built for each information source using pseudo-relevance feedback. Candidate expansion terms in the pool were ranked based on their weight as formulated in the first step. The top 100 terms from each pool were then combined and further ranked using a weighted combination of expansion scores. Finally, only the top K terms from the combined pool were used as expansion terms ($K \le 10$).

Balaneshinkordan and Kotov (2016) empirically investigated the effectiveness in adhoc search tasks of query expansion terms derived from the DBpedia, Freebase and ConceptNet knowledge bases, as well as from the actual document collection. Query expansion terms were derived using information theoretic measures (mutual information) and term associated approaches [term co-occurrence via the Hyperspace Analogous to Language method (Lund and Burgess 1996)]. These were then interpolated with scores from a Dirichlet language model. They found that term associations derived from KBs often provided the highest effectiveness. Compared to Balaneshinkordan and Kotov (2016), we used the more sophisticated EQFE model to select and combine entities to augment the initial user’s query. We also took a radically different approach for estimating entity mapping and selection, and further explored more choices available when using KB for query expansion.

Balaneshinkordan and Kotov (2016) found that ConceptNet proved the most effective source of query expansions for general, adhoc tasks. ConceptNet is a KB that represents commonsense knowledge. This is in line with previous work that also found ConceptNet to be a valuable source of expansion terms for adhoc, not domain-specific, searches (Kotov and Zhai 2012). In this article, we have not explored the use of ConceptNet, as terms and associations captured there do not appear to be relevant for CHS. For example, in ConceptNet, the term “insomnia” is linked to irrelevant, non health-related concepts such as “alternative rock” and “alternative progressive”. When links to health-related concepts do they exist, the quality is poor. For example, identified causes of insomnia in ConceptNet are “going to bed”, “coffee” and “surfing the net”.^{Footnote 2} This is, of course, a very limited account of the causes of insomnia (as identified by the Sleep Foundation).^{Footnote 3}

Xiong and Callan (2015) considered query expansion using Freebase as a KB and, like us, considered the choices involved when setting up systems to do this, including their effectiveness in web search tasks. In contrast, they consider a limited array of choices, including: entity mention extraction (akin to our Choice 2) and selection of expansion terms (we do not have this as the EQFE model is used to determine the expansion terms to be selected). For each of the two choices, they only explored two variants, while we explore many variations for choices in KB retrieval. Specifically, for entity mention extraction they considered either direct (query) keyword match or object frequency from automatic annotations contained in Google’s FACC1 annotation set. For selection of expansion terms they considered a pseudo-relevance feedback approach (which somewhat is comparable, in spirit, to our analysis of relevance feedback mechanisms—Choice 5) and a supervised classification approach (SVM).

Liu and Fang (2015) developed a method for entity-based retrieval that represented entity in a latent space and computed retrieval scores by mapping document and query entities to the common entity latent space and then considered the projections of documents and queries in such space. Their approach is alternative to the EQFE method used in this article—a comparison between the latent entity space of Liu and Fang and EQFE in CHS settings is out of the scope of this article; however we intend to direct future work towards this comparison.

The query expansion technique we considered in this work, EQFE, applies entity extraction and analysis to the query expansion stage of the retrieval process. Other techniques, instead, use entities throughout the different stage of retrieval (i.e., in both indexing and retrieval). This is the case, for example, of the concept-based IR model, Explicit Semantic Analysis (Egozi et al. 2011), which relied on entities represented in Wikipedia to identify suitable indexing and retrieval features. A similar approach to concept/entity-based IR had been followed by methods in the medical domain. For example, Zuccon et al. (2012) used the SNOMED-CT terminology to represent medical entities at indexing and retrieval. Their method further exploited subsumption (i.e., parent-child) relationships between entities to derive query expansion terms. While, Koopman et al. (2012) used co-occurrence graphs between entities in the same document for retrieval, also relying on an entity-based indexing and retrieval mechanism. The downside of these methods is that entity indexing is often computationally demanding (e.g., entity extraction and annotation much be run across all document in the corpus) and thus difficult to scale to large web corpora (such as those used in this article).

2.2 Consumer health search (CHS)

One of the major challenges in CHS is the vocabulary mismatch between people’s query terms and the terms used in high quality health web resources. One source of high quality health related terms is the Unified Medical Language System (UMLS) (Bodenreider 2004). However, UMLS concepts are rarely mentioned in consumer health queries: Keselman et al. (2008) showed that only 8.1% of 4,928,158 n-grams from consumer queries can be mapped (i.e., exact match) to the UMLS concepts. In this section, we discuss work related to knowledge-base retrieval for CHS.

In constrast, Wikipedia is a crowdsourced, general purpose KB allowing people to promote and describe new concepts or augment existing concepts. While general purposes, Wikipedia contains considerable and detailed health information that has been effectively used in health related information retrieval (Jimmy et al. 2018; Soldaini et al. 2015).

In an earlier study, we evaluated several design choices to instantiate the EQFE model in CHS (Jimmy et al. 2018). These were:

1.
Collect pages with medicine infobox^{Footnote 4} type^{Footnote 5} (e.g., “abortion method”, “alternative medicine”, “pandemic”);
2.
Collect pages with health infobox type or with links to medical terminologies such as UMLS, Disease DB and ICD in the health infobox;
3.
Collect pages that had a least one UMLS entity mention in their title. Entity extraction was done using QuickUMLS (Soldaini and Goharian 2016).

Previously, Soldaini et al. (2015) utilised Wikipedia to select health related terms from clinical case reports. First, they built a health related Wikipedia KB by collecting pages that contained infobox with links to medical terminologies and a non-health related Wikipedia KB that contained the other pages. Then, they calculated the probability of a term being health related by computing the ratio between the probability of the term being found in the health KB and that of the term being found in the non-health KB. We employed a similar method to limit the terms selected by relevance feedback (RF) processes (either explicit or pseudo RF) (see Sect. 3.2.5).

The probability of a term being health related is also an effective method to select expansion terms for CHS (Soldaini et al. 2016). Here medical synonyms were extracted by mapping query terms to 3 medical KBs (Behavioral, MedSyn, or DBpedia). Then, a synonym with the highest probability of being health related was added to the original query. Finally, a supervised classifier was used to select the most likely synonym for each query. In our study, we further explored features of KB (beyond synonyms) to improve the effectiveness of CHS queries.

In contrast with Wikipedia, the UMLS is a medical specific knowledge base that contains medical concepts and relationships among concepts (Bodenreider 2004). Its latest 2017 version (i.e., 2017AB) contains approximately 3.64 million concepts that are compiled from 201 biomedical vocabularies in various languages. Each UMLS concept is grouped into one or more semantic types (out of 133 semantic types in total). As the UMLS is compiled from biomedical vocabularies, it contains many semantic types that are not relevant to CHS such as amino acid sequence, cell function, embryonic structure, etc. For this reason, Soldaini et al. (2016) and Limsopatham et al. (2013) decided to include only concepts from 16 semantic types that were considered as related to the four aspects of medical decision criteria: symptom, diagnostic test, diagnosis, and treatment. In our experiments using the UMLS, we follow the same practice.

Using UMLS for CHS still results in vocabulary mismatch between people queries and the medical term in the UMLS (Keselman et al. 2008). To overcome this, the Consumer Health Vocabulary (CHV) (Zeng and Tse 2006; Keselman et al. 2008) was built; this open access resource provides a mapping between consumer health terms and UMLS concepts.

This mapping is constructed by extracting n-grams from MedlinePlus queries and various health-focused bulletin boards; then, automatically mapping these n-grams to UMLS via exact match comparison. Any un-mapped n-grams are then manually mapped to the UMLS (Keselman et al. 2008). From 2007, the CHV is available as part of the UMLS entries with “CHV” as source (i.e., SAB).

Both UMLS and Wikipedia have been used as learning to rank features (LtR) for CHS (Soldaini and Goharian 2017). The results showed that using Wikipedia average idf and tf in health pages were the first and third best LtR features, respectively. Using UMLS, the number of matching UMLS concepts in document, the number of “sign or symptom” concepts found in a document, and the number of “injury or poisoning” concepts found in document were the second, fifth, and seventh best LtR features, respectively. The best LtR system from Soldaini and Goharian (2017) beat a baseline system by 26.6% on the CLEF2016 dataset (nDCG@10: 0.305 vs nDCG@10: 0.241). This is the same dataset used in this article; thus, we used the results of their study as a benchmark.

In this study, we posit that Wikipedia, UMLS, and CHV have the potential to improve the consumer health search. We evaluated the effectiveness of various CHS design choices using these three KBs.

3 Methodology

3.1 Expansion model

We implemented the Entity Query Feature Expansion (EQFE) model for retrieval on the Wikipedia, UMLS, and CHV as the KB. The EQFE model aims to enrich a query with features from KB entities that are linked to the query. For the Wikipedia KB, a single entity is represented by a single Wikipedia page (the page title identifies the entity). Beyond titles, Wikipedia also contains many page features useful in a retrieval scenario: entity title (E), categories (C), links (L), aliases (A), and body (B). As for the UMLS and CHV KBs, a single entity is represented by the most frequently used terms for a single concept unique identifier (CUI). Features of a UMLS and CHV KB entity are aliases (A), body (B), parent concepts (P), and related concepts (R). Figure 1 shows the features we used for mapping the queries to entities in the KB and as the source of expansion terms. We formally define the query expansion model as:

$$\begin{aligned} \hat{\vartheta }_q = \sum _{M}^{} \sum _{f}^{} \lambda _f \vartheta _{f(EM, SE)} \end{aligned}$$

(1)

where M are the entity mentions and contain uni-, bi-, and tri-gram generated from the query; f is a function used to extract the expansion terms. $\lambda _f \epsilon (0,1)$ is a weighting factor. $\vartheta _{f(EM, SE)}$ is a function to map entity mention M to the KB features EM (e.g., “Title”, “Aliases”, “Links”, “Body”, etc.) and extract expansion terms from source of expansion SE (e.g.,“Title”, “Aliases”, etc.).

3.2 Choices in knowledge base retrieval

This section describes the choices that we considered for each component of the EQFE pipeline (Fig. 2). To select the expansion terms, first, we constructed a number of knowledge bases (KBs). Each KB contains features such as title, aliases, etc. Second, we extracted entities from the original queries. Third, we mapped the query entities to entities in each KB by exact matching each query entity to every KB’s features. Fourth, we sourced expansion terms from the mapped KB entities’ features. Finally, fifth, we performed relevance feedback with the aim to further improve the already expanded queries. The remainder of this section will describe the choices in details.

3.2.1 Choice 1: knowledge base construction

We investigated which entities should form the basis of our KB. The CHS focus meant that health-related entities were needed. For Wikipedia KB, we considered four Wikipedia Construction (WC) choices for collecting health related pages:

WC-All::: all wikipedia pages;
WC-Type::: pages with Medicine infobox^{Footnote 6} type^{Footnote 7} (e.g., “abortion method”, “alternative medicine”, “pandemic”);
WC-TypeLinks::: pages with Medicine infobox type and pages with infobox containing links to medical terminologies such as Mesh, UMLS, SNOMED CT, ICD;
WC-UMLS::: pages with title matching an UMLS entity.

The last method used QuickUMLS (Soldaini and Goharian 2016) to map Wikipedia page titles to the UMLS: if the mapping was successful, we included the Wikipedia entity (page) in the KB.

For UMLS and CHV KBs, we considered the following UMLS Construction (UC) and CHV Construction (CC) choices:

UC/CC-All::: all entities;
UC/CC-Med::: entities related to four key aspects of medical decision criteria (i.e., symptoms, diagnostic test, diagnoses, and treatments) as used in (Limsopatham et al. 2013; Soldaini et al. 2016).

For these choices, we included all English and non-obsolete terms.

3.2.2 Choice 2: entity mention extraction

Entity mention extraction is the process of identifying spans of text from the query that could map to some entity, while it does not consider which exact entity a span is mapped to (this is detailed in the next section). We considered four possible Mention Extraction (ME) choices to extract entity mentions (see Fig. 3):

ME-All::: include all uni-, bi- and tri-grams of the query (default choice);
ME-CHV::: include only those uni-, bi- and tri-grams of the query that matched entities in the Consumer Health Vocabulary (CHV) (Keselman et al. 2006);^{Footnote 8}
ME-UMLS::: include only those uni-, bi- and tri-grams of the query that matched entities in the UMLS (via QuickUMLS);
ME-MetaMap::: include only those uni-, bi- and tri-grams of the query that matched health entities via MetaMap (Aronson and Lang 2010).

These choices were used for all KBs. For ME-CHV, we used the CHV version included in the UMLS version 2017AB [while in our previous work we used CHV version 20110204 (Jimmy et al. 2018)].

3.2.3 Choice 3: entity mapping

We investigated how the entity mentions from the previous section were mapped to entities in the KB. An entity mention was mapped to an entity if an exact match was found between the mention and the entity. As shown in Fig. 1, the Wikipedia entity can be represented according to five different features. The Wikipedia Entity Mapping (WEM) choices considered were:

WEM-Title::: titles;
WEM-Aliases::: aliases;
WEM-Links::: links;
WEM-Body::: the entire bodies of the Wikipedia pages;
WEM-Cat::: categories;
WEM-All::: all the previous sources (default choice).

For UMLS and CHV KBs, the UMLS Entity Mapping (UEM) and CHV Entity Mapping (CEM) choices considered were:

UEM/CEM-Title::: titles;
UEM/CEM-Aliases::: aliases;
UEM/CEM-Body::: the entire UMLS concept description;
UEM/CEM-Parent::: parents;
UEM/CEM-Related::: related entities;
UEM/CEM-All::: all the previous sources (default choice);
UEM/CEM-QuickUmls::: use QuickUMLS to obtain entity mappings.

Table 1 shows the mappings to the Aliases feature of each KB for the query “abdominal pain, vomiting, pain near belly button, duplicated ureter”.

Table 1 Choice 3: Mapped entities for query id 122006: “abdominal pain, vomiting, pain near belly button, duplicated ureter” are mapped to the Aliases feature of each KB

Payoffs and pitfalls in using knowledge-bases for consumer health search

Abstract

Similar content being viewed by others

Choices in Knowledge-Base Retrieval for Consumer Health Search

Focused Query Expansion with Entity Cores for Patient-Centric Health Search

A Compound Model for Consumer Health Search

1 Introduction

2 Related work

2.1 Knowledge-base retrieval

2.2 Consumer health search (CHS)

3 Methodology

3.1 Expansion model

3.2 Choices in knowledge base retrieval

3.2.1 Choice 1: knowledge base construction

3.2.2 Choice 2: entity mention extraction

3.2.3 Choice 3: entity mapping

3.2.4 Choice 4: source of expansion

3.2.5 Choice 5: relevance feedback

4 Data collection

5 Empirical evaluation

5.1 Choice 1: knowledge base construction

5.2 Choice 2: entity mentions extraction

5.3 Choice 3: entity mapping

5.4 Choice 4: source of expansion

5.5 Choice 5: relevance feedback

6 Analysis and discussion

6.1 Generalisability of the best settings

6.2 Mitigating problems with unjudged documents

7 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Statistical significance analysis

Appendix 2: List of abbreviations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation