1 Introduction

A major challenge for users in consumer health search (CHS) is how to effectively represent complex and ambiguous information needs as a query (Zhang 2014; Toms and Latter 2007; Zeng et al. 2002). Studies on query formulation in CHS have shown that consumers struggle to find effective query terms (Zeng et al. 2002), often submitting layman and circumlocutory descriptions of symptoms instead of precise medical terms (Stanton et al. 2014; Zuccon et al. 2015). For example, people search for “skin irregularities” instead of “skin lesions” (the correct medical term for the symptom). They do so using general web search engines, which are commonly preferred over specialised health web sites and services (Fox and Duggan 2013; McDaid and Park 2011). However, previous work has shown that the use of general web search engines for answering these specific health needs leads to poor retrieval effectiveness, incorrect information and possibly low user satisfaction (Zuccon et al. 2015). Different approaches have been proposed to improve CHS, including query suggestion (Zeng et al. 2006), learning-to-rank using syntactic, semantic or readability features (Soldaini and Goharian 2017; Palotti et al. 2016), and query expansion or reformulation (Soldaini et al. 2016; Silva and Lopes 2016; Plovnick and Zeng 2004).

Here we focus on overcoming the problems in CHS by expanding a health query with more effective terms (e.g., less ambiguous, synonyms, etc.). For example, the query “skin tag” can be expanded by adding the term “acrochordon” which is a medical term for skin tag. The term “acrochordon” provides better disambiguation as it effectively represents the original two terms query. Documents containing the term “acrochordon” are more likely to be relevant to the query than documents containing either “skin” or “tag” alone.

A valuable source of medical domain knowledge is contained in carefully curated medical knowledge bases (KBs); for example, the UMLS medical thesaurus.Footnote 1 Studies have shown that manually replacing query terms with those from medical knowledge bases has proven effective (Plovnick and Zeng 2004)—but can it be done automatically?

How to effectively utilise the KB to improve retrieval involves a large number of important design decisions. The impact of these different decisions has not been thoroughly and rigorously considered in most previous approaches (Bendersky et al. 2012; Dalton et al. 2014). Thus, in this paper, we also seek to empirically evaluate the impact of a number of different choices in KB retrieval.


Key contributions

  • The implementation and evaluation of a state-of-the-art knowledge base retrieval method to consumer health search;

  • The impact of implementation choices, including: (i) KB construction; (ii) entity mention extraction; (iii) entity mapping; (iv) source of expansion; (v) use of relevance feedback. We also determine whether the use of a specialised KB is preferred over a general purpose one, or vice versa.

While some of this material is covered in an existing study (Jimmy et al. 2018), this articles includes the following additional contributions:

  • An extended literature review highlighting key works that have proposed methods to exploit knowledge bases and knowledge graphs for query expansion, both within and outside health search.

  • An expanded explanation of the methods by integrating a meaningful example that aids the understanding of the key differences produced by each considered choice in the KB query expansion process.

  • The addition of the Consumer Health Vocabulary (CHV) as another knowledge base (Choice 1). CHV provides a mapping between professional medical lingo and consumer expressions (Zeng and Tse 2006; Keselman et al. 2008).

  • The extraction of query entity mentions (Choice 2) using Metamap (Aronson and Lang 2010) (a biomedical information extraction system).

  • A study of combining expansion terms from all KBs (Wikipedia, UMLS and CHV) when considering the source of expansion for term selection (Choice 4).

  • An evaluation of an alternative approach for relevance feedback and pseudo relevance feedback (Choice 5) based on Soldaini et al. (2015)’s work, which filters expansion terms based on their likelihood of being health related.

  • An investigation of results generalisability via evaluation using an additional test collection, CLEF eHealth 2015, which used different queries and different websites crawls.

  • An analysis of the influence of unjudged documents on retrieval results, including evaluating using the combined relevance assessments from CLEF 2016 and CLEF 2017, and using condensed list approach (Sakai 2007).

The remainder of this paper is structured as follows. Section 2 discusses previous work related to this article. Section 3 describes the query expansion model used and the choices we consider for knowledge base retrieval. Section 4 explains the data collection used in this work. Section 5 details the empirical evaluation performed and the evaluation results. Section 6 analyses and discusses the evaluation results, while Sect. 7 concludes this article. Additionally, “Appendix 1: Statistical significance analysis” section reports the statistical significance analysis for all the results of the experiments discussed in this article, and “Appendix 2: List of abbreviations” section lists the abbreviations used to provide the reader with a quick-to-consult reference.

2 Related work

2.1 Knowledge-base retrieval

Knowledge bases such as Wikipedia and Freebase have been used to automatically improve retrieval effectiveness by augmenting user-issued queries. We start by introducing the method we rely on in this article: the Entity Query Feature Expansion (EQFE) (Dalton et al. 2014) (The actual formulation of the method is detailed in Sect. 3.1). This model performs automated query expansion by linking mentions from the original query to concepts in Wikipedia. Instead of achieving this through a direct mapping (as we later show Bendersky et al. (2012) did), the Entity Query Feature Expansion model labels words in the query and in each document with a set of entity mentions \(M_Q\) and \(M_d\) (Dalton et al. 2014). Each entity mention is related to KB entities \(e\ \epsilon \ E\), with different relationship types. Queries are then expanded by including entity aliases, categories, words, and types from their related Wikipedia articles. The expanded queries are then matched against documents in the corpus using the query likelihood model with Dirichlet smoothing.

We posit that this Entity Query Feature Expansion model is a natural fit for consumer health search. It provides a means of mapping health queries to health entities in a health related (subset of a) KB, be this either a general purpose KB (e.g., Wikipedia) or a domain-specific KB (e.g., UMLS). The initial query can then be expanded based on related entities. In this article, we investigated the use of both a specialised health KB, in line with previous work that expanded queries using, e.g., MeSH or UMLS (Soldaini et al. 2016; Díaz-Galiano et al. 2009; Silva and Lopes 2016), and a general purpose KB, Wikipedia. Our rationale for this latter choice was the observation that consumers tend to submit queries using general terms and that these are covered by Wikipedia entities. However, Wikipedia also covers many of the medical entities found in specialised medical KBs. More importantly, there are links between the general and specialised entities in Wikipedia—links that can be exploited for query expansion. For the same reason, we have further extended the choices we investigated for KB construction by also considering the consumer health vocabulary (CHV), which, like Wikipedia, provides a direct link between professional lingo and consumer expressions (e.g. “myocardial infarction” \(\Rightarrow \) “heart attack”); however, unlike Wikipedia, CHV does this explicitly, rather than implicitly. Thus, we adopted the Entity Query Feature Expansion model for our empirical evaluation, determining if such a KB retrieval approach is effective for CHS.

Other methods for knowledge base retrieval do exist: next we provide a brief account of selected methods used for KB retrieval.

For example, Bendersky et al. (2012) proposed a query formulation approach that links queries to concepts in multiple information sources such as Wikipedia, query logs, and the retrieval corpus itself, using pseudo-relevance feedback. First, they weighted concepts from the query by considering the frequency of each concept found in Google N-grams, MSN Query log , Wikipedia Titles, and the retrieval corpus. Then, a large pool of candidate expansion terms was built for each information source using pseudo-relevance feedback. Candidate expansion terms in the pool were ranked based on their weight as formulated in the first step. The top 100 terms from each pool were then combined and further ranked using a weighted combination of expansion scores. Finally, only the top K terms from the combined pool were used as expansion terms (\(K \le 10\)).

Balaneshinkordan and Kotov (2016) empirically investigated the effectiveness in adhoc search tasks of query expansion terms derived from the DBpedia, Freebase and ConceptNet knowledge bases, as well as from the actual document collection. Query expansion terms were derived using information theoretic measures (mutual information) and term associated approaches [term co-occurrence via the Hyperspace Analogous to Language method (Lund and Burgess 1996)]. These were then interpolated with scores from a Dirichlet language model. They found that term associations derived from KBs often provided the highest effectiveness. Compared to Balaneshinkordan and Kotov (2016), we used the more sophisticated EQFE model to select and combine entities to augment the initial user’s query. We also took a radically different approach for estimating entity mapping and selection, and further explored more choices available when using KB for query expansion.

Balaneshinkordan and Kotov (2016) found that ConceptNet proved the most effective source of query expansions for general, adhoc tasks. ConceptNet is a KB that represents commonsense knowledge. This is in line with previous work that also found ConceptNet to be a valuable source of expansion terms for adhoc, not domain-specific, searches (Kotov and Zhai 2012). In this article, we have not explored the use of ConceptNet, as terms and associations captured there do not appear to be relevant for CHS. For example, in ConceptNet, the term “insomnia” is linked to irrelevant, non health-related concepts such as “alternative rock” and “alternative progressive”. When links to health-related concepts do they exist, the quality is poor. For example, identified causes of insomnia in ConceptNet are “going to bed”, “coffee” and “surfing the net”.Footnote 2 This is, of course, a very limited account of the causes of insomnia (as identified by the Sleep Foundation).Footnote 3

Xiong and Callan (2015) considered query expansion using Freebase as a KB and, like us, considered the choices involved when setting up systems to do this, including their effectiveness in web search tasks. In contrast, they consider a limited array of choices, including: entity mention extraction (akin to our Choice 2) and selection of expansion terms (we do not have this as the EQFE model is used to determine the expansion terms to be selected). For each of the two choices, they only explored two variants, while we explore many variations for choices in KB retrieval. Specifically, for entity mention extraction they considered either direct (query) keyword match or object frequency from automatic annotations contained in Google’s FACC1 annotation set. For selection of expansion terms they considered a pseudo-relevance feedback approach (which somewhat is comparable, in spirit, to our analysis of relevance feedback mechanisms—Choice 5) and a supervised classification approach (SVM).

Liu and Fang (2015) developed a method for entity-based retrieval that represented entity in a latent space and computed retrieval scores by mapping document and query entities to the common entity latent space and then considered the projections of documents and queries in such space. Their approach is alternative to the EQFE method used in this article—a comparison between the latent entity space of Liu and Fang and EQFE in CHS settings is out of the scope of this article; however we intend to direct future work towards this comparison.

The query expansion technique we considered in this work, EQFE, applies entity extraction and analysis to the query expansion stage of the retrieval process. Other techniques, instead, use entities throughout the different stage of retrieval (i.e., in both indexing and retrieval). This is the case, for example, of the concept-based IR model, Explicit Semantic Analysis (Egozi et al. 2011), which relied on entities represented in Wikipedia to identify suitable indexing and retrieval features. A similar approach to concept/entity-based IR had been followed by methods in the medical domain. For example, Zuccon et al. (2012) used the SNOMED-CT terminology to represent medical entities at indexing and retrieval. Their method further exploited subsumption (i.e., parent-child) relationships between entities to derive query expansion terms. While, Koopman et al. (2012) used co-occurrence graphs between entities in the same document for retrieval, also relying on an entity-based indexing and retrieval mechanism. The downside of these methods is that entity indexing is often computationally demanding (e.g., entity extraction and annotation much be run across all document in the corpus) and thus difficult to scale to large web corpora (such as those used in this article).

2.2 Consumer health search (CHS)

One of the major challenges in CHS is the vocabulary mismatch between people’s query terms and the terms used in high quality health web resources. One source of high quality health related terms is the Unified Medical Language System (UMLS) (Bodenreider 2004). However, UMLS concepts are rarely mentioned in consumer health queries: Keselman et al. (2008) showed that only 8.1% of 4,928,158 n-grams from consumer queries can be mapped (i.e., exact match) to the UMLS concepts. In this section, we discuss work related to knowledge-base retrieval for CHS.

In constrast, Wikipedia is a crowdsourced, general purpose KB allowing people to promote and describe new concepts or augment existing concepts. While general purposes, Wikipedia contains considerable and detailed health information that has been effectively used in health related information retrieval (Jimmy et al. 2018; Soldaini et al. 2015).

In an earlier study, we evaluated several design choices to instantiate the EQFE model in CHS (Jimmy et al. 2018). These were:

  1. 1.

    Collect pages with medicine infoboxFootnote 4 typeFootnote 5 (e.g., “abortion method”, “alternative medicine”, “pandemic”);

  2. 2.

    Collect pages with health infobox type or with links to medical terminologies such as UMLS, Disease DB and ICD in the health infobox;

  3. 3.

    Collect pages that had a least one UMLS entity mention in their title. Entity extraction was done using QuickUMLS (Soldaini and Goharian 2016).

Previously, Soldaini et al. (2015) utilised Wikipedia to select health related terms from clinical case reports. First, they built a health related Wikipedia KB by collecting pages that contained infobox with links to medical terminologies and a non-health related Wikipedia KB that contained the other pages. Then, they calculated the probability of a term being health related by computing the ratio between the probability of the term being found in the health KB and that of the term being found in the non-health KB. We employed a similar method to limit the terms selected by relevance feedback (RF) processes (either explicit or pseudo RF) (see Sect. 3.2.5).

The probability of a term being health related is also an effective method to select expansion terms for CHS (Soldaini et al. 2016). Here medical synonyms were extracted by mapping query terms to 3 medical KBs (Behavioral, MedSyn, or DBpedia). Then, a synonym with the highest probability of being health related was added to the original query. Finally, a supervised classifier was used to select the most likely synonym for each query. In our study, we further explored features of KB (beyond synonyms) to improve the effectiveness of CHS queries.

In contrast with Wikipedia, the UMLS is a medical specific knowledge base that contains medical concepts and relationships among concepts (Bodenreider 2004). Its latest 2017 version (i.e., 2017AB) contains approximately 3.64 million concepts that are compiled from 201 biomedical vocabularies in various languages. Each UMLS concept is grouped into one or more semantic types (out of 133 semantic types in total). As the UMLS is compiled from biomedical vocabularies, it contains many semantic types that are not relevant to CHS such as amino acid sequence, cell function, embryonic structure, etc. For this reason, Soldaini et al. (2016) and Limsopatham et al. (2013) decided to include only concepts from 16 semantic types that were considered as related to the four aspects of medical decision criteria: symptom, diagnostic test, diagnosis, and treatment. In our experiments using the UMLS, we follow the same practice.

Using UMLS for CHS still results in vocabulary mismatch between people queries and the medical term in the UMLS (Keselman et al. 2008). To overcome this, the Consumer Health Vocabulary (CHV) (Zeng and Tse 2006; Keselman et al. 2008) was built; this open access resource provides a mapping between consumer health terms and UMLS concepts.

This mapping is constructed by extracting n-grams from MedlinePlus queries and various health-focused bulletin boards; then, automatically mapping these n-grams to UMLS via exact match comparison. Any un-mapped n-grams are then manually mapped to the UMLS (Keselman et al. 2008). From 2007, the CHV is available as part of the UMLS entries with “CHV” as source (i.e., SAB).

Both UMLS and Wikipedia have been used as learning to rank features (LtR) for CHS (Soldaini and Goharian 2017). The results showed that using Wikipedia average idf and tf in health pages were the first and third best LtR features, respectively. Using UMLS, the number of matching UMLS concepts in document, the number of “sign or symptom” concepts found in a document, and the number of “injury or poisoning” concepts found in document were the second, fifth, and seventh best LtR features, respectively. The best LtR system from Soldaini and Goharian (2017) beat a baseline system by 26.6% on the CLEF2016 dataset (nDCG@10: 0.305 vs nDCG@10: 0.241). This is the same dataset used in this article; thus, we used the results of their study as a benchmark.

In this study, we posit that Wikipedia, UMLS, and CHV have the potential to improve the consumer health search. We evaluated the effectiveness of various CHS design choices using these three KBs.

3 Methodology

3.1 Expansion model

Fig. 1
figure 1

Summary of expansion sources

We implemented the Entity Query Feature Expansion (EQFE) model for retrieval on the Wikipedia, UMLS, and CHV as the KB. The EQFE model aims to enrich a query with features from KB entities that are linked to the query. For the Wikipedia KB, a single entity is represented by a single Wikipedia page (the page title identifies the entity). Beyond titles, Wikipedia also contains many page features useful in a retrieval scenario: entity title (E), categories (C), links (L), aliases (A), and body (B). As for the UMLS and CHV KBs, a single entity is represented by the most frequently used terms for a single concept unique identifier (CUI). Features of a UMLS and CHV KB entity are aliases (A), body (B), parent concepts (P), and related concepts (R). Figure 1 shows the features we used for mapping the queries to entities in the KB and as the source of expansion terms. We formally define the query expansion model as:

$$\begin{aligned} \hat{\vartheta }_q = \sum _{M}^{} \sum _{f}^{} \lambda _f \vartheta _{f(EM, SE)} \end{aligned}$$
(1)

where M are the entity mentions and contain uni-, bi-, and tri-gram generated from the query; f is a function used to extract the expansion terms. \(\lambda _f \epsilon (0,1)\) is a weighting factor. \(\vartheta _{f(EM, SE)}\) is a function to map entity mention M to the KB features EM (e.g., “Title”, “Aliases”, “Links”, “Body”, etc.) and extract expansion terms from source of expansion SE (e.g.,“Title”, “Aliases”, etc.).

3.2 Choices in knowledge base retrieval

This section describes the choices that we considered for each component of the EQFE pipeline (Fig. 2). To select the expansion terms, first, we constructed a number of knowledge bases (KBs). Each KB contains features such as title, aliases, etc. Second, we extracted entities from the original queries. Third, we mapped the query entities to entities in each KB by exact matching each query entity to every KB’s features. Fourth, we sourced expansion terms from the mapped KB entities’ features. Finally, fifth, we performed relevance feedback with the aim to further improve the already expanded queries. The remainder of this section will describe the choices in details.

Fig. 2
figure 2

The EQFE pipeline we considered in this article when instantiating this model. In this model, q is the original query, q’ is an expanded query, Exp is the expansion terms, and q” is a query expanded with (pseudo-) relevance feedback (p(rf)), after the original query was augmented using query expansion

3.2.1 Choice 1: knowledge base construction

We investigated which entities should form the basis of our KB. The CHS focus meant that health-related entities were needed. For Wikipedia KB, we considered four Wikipedia Construction (WC) choices for collecting health related pages:

WC-All::

all wikipedia pages;

WC-Type::

pages with Medicine infoboxFootnote 6 typeFootnote 7 (e.g., “abortion method”, “alternative medicine”, “pandemic”);

WC-TypeLinks::

pages with Medicine infobox type and pages with infobox containing links to medical terminologies such as Mesh, UMLS, SNOMED CT, ICD;

WC-UMLS::

pages with title matching an UMLS entity.

The last method used QuickUMLS (Soldaini and Goharian 2016) to map Wikipedia page titles to the UMLS: if the mapping was successful, we included the Wikipedia entity (page) in the KB.

For UMLS and CHV KBs, we considered the following UMLS Construction (UC) and CHV Construction (CC) choices:

UC/CC-All::

all entities;

UC/CC-Med::

entities related to four key aspects of medical decision criteria (i.e., symptoms, diagnostic test, diagnoses, and treatments) as used in (Limsopatham et al. 2013; Soldaini et al. 2016).

For these choices, we included all English and non-obsolete terms.

Fig. 3
figure 3

Extracting entity mentions from the query “natural cures for lifelong insomnia”: the influence of different choices for entity extraction (Choice 2)

3.2.2 Choice 2: entity mention extraction

Entity mention extraction is the process of identifying spans of text from the query that could map to some entity, while it does not consider which exact entity a span is mapped to (this is detailed in the next section). We considered four possible Mention Extraction (ME) choices to extract entity mentions (see Fig. 3):

ME-All::

include all uni-, bi- and tri-grams of the query (default choice);

ME-CHV::

include only those uni-, bi- and tri-grams of the query that matched entities in the Consumer Health Vocabulary (CHV) (Keselman et al. 2006);Footnote 8

ME-UMLS::

include only those uni-, bi- and tri-grams of the query that matched entities in the UMLS (via QuickUMLS);

ME-MetaMap::

include only those uni-, bi- and tri-grams of the query that matched health entities via MetaMap (Aronson and Lang 2010).

These choices were used for all KBs. For ME-CHV, we used the CHV version included in the UMLS version 2017AB [while in our previous work we used CHV version 20110204 (Jimmy et al. 2018)].

3.2.3 Choice 3: entity mapping

We investigated how the entity mentions from the previous section were mapped to entities in the KB. An entity mention was mapped to an entity if an exact match was found between the mention and the entity. As shown in Fig. 1, the Wikipedia entity can be represented according to five different features. The Wikipedia Entity Mapping (WEM) choices considered were:

WEM-Title::

titles;

WEM-Aliases::

aliases;

WEM-Links::

links;

WEM-Body::

the entire bodies of the Wikipedia pages;

WEM-Cat::

categories;

WEM-All::

all the previous sources (default choice).

For UMLS and CHV KBs, the UMLS Entity Mapping (UEM) and CHV Entity Mapping (CEM) choices considered were:

UEM/CEM-Title::

titles;

UEM/CEM-Aliases::

aliases;

UEM/CEM-Body::

the entire UMLS concept description;

UEM/CEM-Parent::

parents;

UEM/CEM-Related::

related entities;

UEM/CEM-All::

all the previous sources (default choice);

UEM/CEM-QuickUmls::

use QuickUMLS to obtain entity mappings.

Table 1 shows the mappings to the Aliases feature of each KB for the query “abdominal pain, vomiting, pain near belly button, duplicated ureter”.

Table 1 Choice 3: Mapped entities for query id 122006: “abdominal pain, vomiting, pain near belly button, duplicated ureter” are mapped to the Aliases feature of each KB
Table 2 Choice 4: Expansion terms selected for each KB when considering different variants for the choice source of expansion. For this example, the initial query was id 103004: “headaches caused by too much blood or “high blood pressure””

3.2.4 Choice 4: source of expansion

We investigated which sources in the KB were used to draw candidate terms for query expansion. We explored three Source of Expansion (SE) choices:

SE-Title::

titles associated with the entities;

SE-Aliases::

aliases associated with the entities;

SE-All::

both titles and aliases (default choice).

While other information sources could be used (for example, those used for entity mapping), preliminary experiments showed that only these three choices produced meaningful results. These choices were used for all KBs (Wikipedia, UMLS, and CHV). An example of the different outputs obtained by each variant for this choice is shown in Table 2.

3.2.5 Choice 5: relevance feedback

The unique challenges of CHS make explicit relevance feedback (RF, i.e., where feedback comes from the user) a worthwhile consideration for improving retrieval effectiveness. The question that follows is what gains are possible if the user was providing explicit feedback? To answer this we apply RF by using the actual relevance labels (qrels) to simulate an accurate user selecting relevant documents. Comparison is made to a non-RF baseline to determine the effective gain from explicit RF. In this study, we investigated the use of relevance feedback (both explicit relevance feedback (RF) and Pseudo Relevance Feedback (PRF)) as used in Jimmy et al. (2018).

We performed RF by extracting the 10 most important health related words (based on tf.idf scores) from each of the top three relevant documents (relevance label greater than 0) thus resulting in a maximum of thirty expansion terms. PRF was performed by extracting the 10 most important health related words from the top three ranked documents (regardless of their true relevance label). A term was considered as health related if it exactly matched a title or an alias of an entity in the target KB: either Wikipedia (WC-TypeLinks) or UMLS (UC-All).

In addition, in this study we also considered the relevance feedback approach proposed by Soldaini et al. (2015). We refer to this approach as RF Health Terms (RFHT) and PRF Health Terms (PRFHT), as they filtered the candidate relevance feedback terms based on the probability of the term being health related, based on likelihoods computed from Wikipedia (see Sect. 2.2).

In PRFHT, all terms in the top k results with high probability of being health-related are extracted and used for query expansion. This probability is calculated as:

$$\begin{aligned} OR(t_j) = \frac{Pr\{P\ {\textit{is}}\ {\textit{health}}\ {\textit{related}}\ | t_j \epsilon P\}}{Pr\{P\ {\textit{is}}\ {\textit{not}}\ {\textit{health}}\ {\textit{related}}\ | t_j \epsilon P\}} \end{aligned}$$
(2)

where P is a Wikipedia page and term \(t_j\) is included in a query if \(OR(t_j) \ge \delta \). In our experiments, we calculated the probabilities of a Wikipedia page P being health related and being not-health related as:

$$\begin{aligned} Pr\{P\ {\textit{is}}\ {\textit{health}}\ {\textit{related}} \ | t_j\ \epsilon \ P\}= \frac{|P\ \epsilon \ D_h : t_j\ \epsilon \ P|}{|D_h|} \end{aligned}$$
(3)
$$\begin{aligned} Pr\{P\ {\textit{is}}\ {\textit{not}}\ {\textit{health}}\ {\textit{related}}\ | t_j\ \epsilon \ P\}= \frac{|P\ \epsilon \ D_{nh} : t_j\ \epsilon \ P|}{|D_{nh}|} \end{aligned}$$
(4)

where \(D_h\) is a collection of Wikipedia pages with health infobox and links to medical terminologies (i.e., WC-TypeLinks) and \(D_{nh}\) contains Wikipedia pages that are not included in \(D_h\). Using the English subset of Wikipedia crawled on the 1/12/2016, we found that \(|D_h| = 13{,}135\) and \(|D_{nh}| = 9{,}182{,}304\).

While Soldaini et al. (2015) suggested that the optimal value for \(\delta \) is 2, in preliminary experiments we found that \(\delta = 2\) is too low, as many non-health terms scored \(\delta \ge 2\); in this study, instead, we used \(\delta = 4\) as it was a better fit. This difference was likely due to a different Wikipedia dump being used: ours was substantially larger than that reported by Soldaini et al. Further, to prevent query drift, we further limited the number of expansion terms added for PRFHT to 20.

Once terms are filtered to retain only terms estimated to be health related, the j-th health term in document \(D_i\) is weighted according to:

$$\begin{aligned} b_j = log_{10}(10 + w_j) \end{aligned}$$
(5)

where:

$$\begin{aligned} w_j = \alpha \ . \ I_q(t_j) \ . \ tf_j + \bigg (\frac{\beta }{k}\bigg ) \ . \ \sum _{i=1}^{k} I_{D_i} (t_j).idf_j \end{aligned}$$
(6)

Following the work of Soldaini et al. (2015), we fixed \(k = 10\), \(\alpha = 2.0 \) and \(\beta = 0.75\). In Eq. 6, \(I_q(t_j) \ = \ 1\) if \(t_j\ \epsilon \ Q\), and 0 if otherwise. \(I_{D_i} (t_j) = 1\) if \(t_j\ \epsilon \ D_i\), and 0 if otherwise.

For the explicit relevance feedback (RFHT), we modified the above PRFHT approach to only extract terms from the top k explicitly relevant documents. Unlike the PRFHT, for RFHT, we did not limit the number of expansion terms added: all expansion terms with \(\delta \ge 4\) were added to the original query.

4 Data collection

To investigate the influence choices in KB retrieval have on query expansion for the CHS task, we empirically evaluated methods using the CLEF 2016 eHealth (Zuccon et al. 2016). This collection comprises 300 query topics originating from health consumers seeking health advice online. Documents are taken from Clueweb12b-13. The collection was indexed using Elasticsearch 5.1.1, with stopping and stemming. A simple baseline was implemented using BM25F with \(b=0.75\) and \(k1=1.2\). BM25F allows specifying boosting factors for matches occurring in different fields of the indexed web page. We considered only the title field and the body field, with boost factors 1 and 3, respectively. These were found to be the optimal weights for BM25F for this test collection in previous work (Jimmy et al. 2016). This is a strong baseline as it outperforms most runs submitted to CLEF 2016.

For constructing the Wikipedia KB, we considered candidate pages from the English subset of Wikipedia (dump 1/12/2016), limited to current revisions only and without talk or user pages. Of the 17 million entries, we filtered out pages that were redirects; this resulted in a Wikipedia corpus of 9,195,439 pages (i.e., WC-All). These candidate pages were then processed according to the choices available for KB construction (Sect. 3.2.1). The total number of pages included in WC-Type is 9562 pages, in WC-TypeLinks is 13,135 pages, and in WC-UMLS is 1,112,206 pages. Selected pages to be included in the KB were also indexed using Elasticsearch 5.1.1 with field based indexing, to support the use of different fields as the source of query expansion terms (Sect. 3.2.4). For all Wikipedia KBs, we indexed the following fields: title (text node of element node <title>), links (outbound links to other Wikipedia pages), categories (as defined in [[Category:category name]]), types (types of all infoboxes in a page), aliases (text node of element node <title> from the page’s redirects), and body (text node of element node <text>).

For constructing the UMLS KB, we indexed non obsolete English terms (i.e., UC-All) with the following fields: title (the most frequently used term for a CUI), aliases (for all other terms used for the CUI), body (the description of a CUI), parent (title of UMLS entities with relationship type PAR), related (title of UMLS entities with relationship type RQ and RL). Similar to the Wikipedia KB, we processed these UMLS terms according to choices in constructing UMLS KB as described in Sect. 3.2.1 and obtained 3,057,234 terms in UC-All and 1,344,941 UMLS terms in UC-Med.

The CHV KB was constructed by selecting UMLS KB entries with the UMLS SAB field equal to “CHV”. The CHV KB index structure was identical to the UMLS KB. For the CHV based KB, we obtained 56,350 terms in CHV-All and 34,514 terms in CHV-Med.

5 Empirical evaluation

Results were evaluated using nDCG@10, RBP@10 (persistence 0.5, depth 10, reporting also residuals (Res.)), in line with the CLEF 2016 collection, as users in the CHS task tend to primarily examine the first few search results. Additionally, bpref was used as a first attempt to reduce the influence of unjudged documents on evaluation (expanded queries retrieved many more unjudged documents than the baseline). For brevity, a full account of statistical significant differences (pairwise t-test with Bonferroni adjustment and \(\alpha <0.05\)) between results is reported in “Appendix 1: Statistical significance analysis” section. Furthermore the average number of terms added in the expanded query (\(\overline{|exp|}\)) and the number of expanded queries, queries with a gain for RBP@10 and a loss for RBP@10 were recorded as a triplet \({<}e,g,l{>}\).

For each choice, we empirically evaluated the influence the choice had on retrieval effectiveness by examining each choice sequentially. We did this for all KBs, and drew conclusions about which KB best supports CHS at the end. For each choice, we fixed the best setting and use this best setting for the subsequent choice. We determined the best setting firstly based on results (i.e., nDCG@10, bpref, RBP@10) for all queries set. If no method was clearly best for this set, then we checked results from the high coverage queries set. Lastly, if results from the high coverage queries set were unable to clearly determine which method was best, then we selected the setting with the highest RBP@10 for all queries set as the best setting (RBP@10 was a primary measure for CLEF 2016). The complete set of results is provided in an online appendix at http://ielab.io/kb-chs, along with all run and software source code used.

5.1 Choice 1: knowledge base construction

The effect on retrieval of choices in KB construction is reported in Table 3 (top); results are averaged over all 300 queries in the CLEF 2016 collection.

Table 3 Influence of choices in KB construction for CLEF2016 (Choice 1). Statistical significance differences reported in Table 16
Fig. 4
figure 4

Unjudged documents among the top 10 retrieved by runs in Table 3 (top)

The results for the Wikipedia KB showed that choice WC-TypeLinks (i.e., pages with health infobox type and links to health terms) lead to the highest effectiveness across all measures. For the UMLS KB, UC-All performed the highest effectiveness on all measures. Lastly, for CHV KB, CC-Med performed the highest across all measures. Nevertheless, the baseline performed considerably better than any KB retrieval method.

When further analysing the results, we found that, for a large number of queries, the KB retrieval methods ranked many unjudged documents amongst the top 10; while the baseline had a much lower rate of unjudged documents amongst the top 10. Figure 4 reported the distribution of unjudged documents for each of the configurations considered. This is clearly influencing the results, as demonstrated by the large values of RBP residuals associated with the KB retrieval methods in Table 3 top (compared to the residual of the baseline). Interestingly, if all unjudged documents turned out to be relevant, the RBP@10 of the KB retrieval methods would prove largely superior than that of the baseline (compare the residuals).

We then considered a subset of queries for which, on average across all runs considered for a specific choice, there were a maximum of 2 unjudged documents out of the first 10. This threshold was determined by analysing the number of unjudged documents for the baseline (the baseline does not change, irrespective of the choices), so that the threshold corresponded to 1.5 times the interquartile range above the third quartile (the upper whisker of the box-plot). Note that this produced a different subset of queries for each of the considered choices; however, the subsets had the same average “coverage” with respect to the relevance assessments. We referred to these subsets as the high coverage queries set. We instead refer to the set containing all the queries as the all queries set. This subset included 12 queries for choice 1 (Table 3, bottom). Results showed reduced residuals and reduced gaps between KB retrieval methods and the baselines; this affected trends in effectiveness across the considered choices for the Wikipedia KB.

Results from Wikipedia KB showed that, for the all queries set, the WC-TypeLinks setting performed best in all three measures. Therefore, although results from the high coverage queries set showed different results, we decided that constructing the Wikipedia KB using the WC-TypeLinks setting was the best option.

Trends in effectiveness for UMLS KB showed that UC-All consistently performed best in both the all queries set and the high coverage queries set. Therefore, we selected UC-All for the following analyses. Lastly, for CHV KB, we found that CC-MED performed best for all queries for all three measures.Thus, we selected CC-Med as the best setting for CHV KB.

Interestingly, the KB constructed with the UC-All choice (that contains many concepts unrelated to the health domain, such as C0030561: Paris, France) performed better than the one constructed with the UC-Med choice (that intuitively would contain more health concepts). As noted in Sect. 4, however, the number of concepts in UC-Med are less than half than those of UC-All. It is likely that there exists a better way to filter out non-health related concepts from the UMLS. Based on this, an avenue for future work is an effective method for selecting the subset of UMLS relevant to CHS queries (i.e., improving the construction of the KB based on the UC-Med setting).

Table 4 Influence of choices in entity mention extraction (Choice 2). Statistical significance differences reported in Table 17

5.2 Choice 2: entity mentions extraction

Table 4 (top: 300 queries and bottom: 19 high coverage queries) reports the results obtained when comparing choices for entity mention extraction. For Wikipedia KB, results from the all queries set (Table 4 top) showed no choice was clearly best. Then, we looked at results from the high coverage queries set. Results from the high coverage queries set showed that the WME-CHV setting performed best for all measures. Therefore, we selected WME-CHV as the best setting for Wikipedia KB and used this settings in the following analyses.

For UMLS KB, we found that UME-UMLS performed best for the all queries set for all three measures. Thus, we selected UME-UMLS as the best setting for UMLS KB.

Lastly, for CHV KB, both the all queries set and the high coverage queries set showed no choice was clearly best. Therefore, we selected CME-CHV as the setting for CHV KB as it performed best for RBP@10 in the all queries set.

Table 5 Influence of choices in entity mapping (Choice 3). Statistical significance differences reported in Tables 18 and 19

5.3 Choice 3: entity mapping

Table 5 (top: 300 queries and bottom: 18 high coverage queries) reports the results obtained when comparing choices for entity mapping. For all KBs, mapping entities to Aliases (WEM-Aliases, UEM-Aliases, and CEM-Aliases) clearly outperformed the other approaches (all queries). Results for the high coverage queries showed mixed results. Thus, we selected WEM-Aliases, UEM-Aliases, and CEM-Aliases for the subsequent analyses.

5.4 Choice 4: source of expansion

Table 6 (top: 300 queries and bottom: 129 high coverage queries) reports the results obtained when comparing sources of query expansion. Results clearly showed that selecting titles as source of expansion (WSE-Title, USE-Title and CSE-Title) was the most effective choice compared to other choices for both Wikipedia KB and UMLS KB. Therefore, we selected WSE-Title, USE-Title, and CSE-Title as the best settings for each corresponding KB.

Then, we investigated the merit of combining expansion terms from the best setting of each KB; e.g., expansion terms for the WikiChv were generated by combining expansion terms from the WSE-Title and CSE-Title settings. In total, we generated four possible combinations: WikiUmlsChv, WikiUmls, WikiChv, and UmlsChv. Results for both the all queries set and the high coverage queries set showed that no choice was clearly best. We then selected WikiChv as the best setting as it returned the highest RBP@10 for the all queries set.

Table 6 Influence of choices in source of expansion (Choice 4). Statistical significance differences reported in Table 20

5.5 Choice 5: relevance feedback

Table 7 (top 300 queries and bottom: 76 high coverage queries) reports the results obtained with and without relevance feedback. For the all queries set, results for All KBs showed that the addition of relevant feedback filtered based on the likelihood of being health related (RFHT) performed the best across all measures. On the contrary, the addition of pseudo relevant feedback hurted the performance for all KBs (with the exception of baselinePRFHT and CSE-TitlePRFHT that had a better bpref than the baseline and CSE-Title without the pseudo relevance feedback).

Results from the high coverage queries set showed similar patterns, where applying RFHT performed best on all measures. The best settings of all KBs with RFHT performed better across all measures compared to the baseline with RFHT.

Table 7 Influence of choices in relevance feedback (Choice 5). Statistical significance differences reported in Table 21 and  22

6 Analysis and discussion

In summary, from Table 7, we highlight the following observations:

  • PRF harmed effectiveness, independent of the KB and of the PRF approach used (including the PRFHT method). While both PRF and PRFHT selected only the top ranked health terms, not all health terms in the top ranked documents were related to the query. For example, the results retrieved by query “lay down cough” (query number 104003) contained many terms related to “coughing”, such as “flu”. While “cough” might relate to flu, pages discussing flu may not necessarily be relevant to the original query. Hence, we found that performing PRF(HT) on expanded queries resulted in query drift, and generated results with higher residuals compared to methods without PRF(HT). Nevertheless, after residuals were reduced through the use of condensed lists (judged documents only, see Sect. 6.2.2 for the results), queries with PRF(HT) generally performed better than without PRF(HT).

  • RF, instead, did provide improved effectiveness, independently of the RF approach, the KB used or the query set (high coverage of all queries).

  • Both PRFHT and RFHT, which used the likelihood of expansion terms to be health related, performed generally better for all measures compared to simple PRF or RF.

  • When using the all queries set and no relevance feedback, and using a combination of expansion terms from both Wikipedia and CHV (WikiCHV) performed best (on all measures). The only exception was the baseline’s nDCG@10 score, which was higher. This was likely because the results obtained with WikiChv contained a higher number of unjudged documents compared to the baseline. This highlights that combining expansion terms from multiple KBs did improve the original CHS queries.

  • For the high coverage queries set, expanded queries with no relevance feedback performed better than the baseline for all measures (see Table 6 (bottom)). This suggests that each KB could be used to effectively expand CHS queries. Overall, the best settings from CHV (CSE-Title) outperformed the best settings from the other KBs.

  • For the high coverage queries, independently of relevance feedback, the best setting for all KBs generated a higher number of queries that produced an effectiveness gain than a loss (see Table 7 (bottom)). In fact, in these cases the gains (loss) are WSE-Title: 52.38% (38.10%), USE-Title: 47.54% (22.95%), CSE-Title: 58.33% (27.78%), and WikiChv: 54.76% (33.33%). When relevance feedback is considered (and in particular, the best feedback technique is used, i.e. RFHT), then the gain (loss) become: WSE-TitleRFHT: 68.42% (22.37%), USE-TitleRFHT: 69.74% (21.05%), CSE-TitleRFHT: 68.42% (23.68%), and WikiChvRFHT: 67.11% (23.68%).

To contextualise the results obtained by the KB retrieval methods, in Table 7, we also reported the results of the method implemented by the GUIR-3 submission to the CLEF 2016 challenge (Soldaini et al. 2016). This was the best performing, comparableFootnote 9 query expansion method at CLEF 2016. The method expands queries by mapping query entities to the UMLS, then navigating the UMLS tree to gather hypernyms from mapped entities as the source of expansion. Post-processing is applied to pure entities unlikely to benefit retrieval. For each query, multiple expanded query variations are collected and their results aggregated using the Borda algorithm (see Soldaini et al. (2016) for details). Unlike the original method, our implementation relied on BM25F rather than DFR as the scoring method and QuickUMLS in place of Metamap as the entity extraction method, so as to be directly comparable with our baseline and KB retrieval methods. In Table 7, we do not report \(\overline{|exp|}\) for GUIR-3 as the method replaces some of the original terms with the expansions, thus making comparisons not trivial.

While Jimmy et al. (2018) suggested that shorter expansion terms are likely to be more effective, in this study we found that is not necessarily true. Table 7 shows that the combination of Wikipedia and CHV based KB (WikiChv) has longer average expansion terms and performed better than the best settings from either Wikipedia KB or CHV based KB. Furthermore, Table 7 also shows that PRFHT and RFHT generate significantly more expansion terms and yet, they are more effective than the PRF and RF approaches.

Fig. 5
figure 5

Changes in RBP@10 between the Entity Query Feature Expansion model utilising the best settings versus baseline. Only high coverage queries are reported

Overall results can hide some underlying trends so we analysis the impact of query expansion on a per-query basis. Figure 5 shows the gains/losses versus baseline obtained by the best settings of Wikipedia KB (WSE-TitleRFHT), UMLS KB (USE-TitleRFHT), CHV KB (CSE-TitleRFHT), and the combination of Wikipedia and Chv KB (WikiChvRFHT). The magnitudes of these changes are shown in the figure. These improvements (or losses) were measured using RBP@10 and thus expanded queries with low coverage are unlikely to perform as effective as expanded queries with high coverage. Gains and losses were similar for the different KBs; i.e., for a given query, the gain or loss was similar irrespective of the KB. Only 5 out of the 76 high coverage queries did not exhibit this trend.

Table 8 Performance gain/loss from expanded queries where RBP@10 gains were found in one or more KB, but losses were found in the other KBs

Next, we investigated features of queries with expanded terms from all KBs without relevance feedback (WSE-Title, USE-Title, CSE-Title, and WikiChv). To do so, we analysed results for the high coverage queries in Choice 4 (Table 6 (bottom)) and found that of the 129 high coverage queries, 12 queries were expanded by all of the four best settings (see Table 9). The small number of overlapping expanded queries from the four best settings suggests that each best setting mostly targeted different queries. Table 9 shows similar patterns to Table 8, where gains and losses were similar for the different KBs.

Table 9 Performance gain/loss from high coverage queries in Table 6 (bottom). Only queries that are expanded by all four best settings (WSE-Title, USE-Title, CSE-Title, and WikiChv) are reported

Then, we investigated the 3 queries from Table 9 where mixed results were obtained across the different KBs (i.e. not all KBs consistently provided a gain (loss) for the query)—these were queries 131,002, 101,001, and 147,001. Table 10 shows that the terms added to each of the 3 queries largely differed depending on the KB used. Interestingly, Wikipedia, although being a general purpose KB, produced more relevant health expansion terms than specialised health KBs (i.e., UMLS and CHV). Nevertheless, we also found that the coverage of the Wikipedia KB was limited compared to that of the UMLS and CHV KBs. In fact, Table 6 (top) shows that the best settings that used Wikipedia KB (WSE-Title) only expanded 76 queries compared to 217 and 155 queries expanded by the best settings used for the UMLS and CHV KBs. This limitation of Wikipedia may be expected as the Wikipedia KB used in this study (WC-TypeLinks) contained only 13,135 terms—this is orders of magnitude smaller than the UMLS KB (UC-All) and CHV KB (CC-Med), which contained 3,057,234 and 1,344,941 terms, respectively.

Table 10 Terms added to queries 131,002: “penis lymphocytic infiltration marked nuclear crush artifact”, 101,001: “inguinal hernia repair laparoscopic mesh benefits risks”, and 147,001: “throat infection sore throat irritated eyes treatment options”
Table 11 The rate of overlap between expansion terms added from KB i with expansion terms added from KB j. For example, 3.5% of expansion terms from the UMLS are found in expansion terms from Wikipedia.

Finally, we investigated how expansion terms from each KBs differ to each other. Table 11 shows the overlap rate among expansion terms from the best settings for all KBs. As expected, all expansion terms from Wikipedia and CHV KBs were found within the expansion terms from WikiChv. These results also further confirmed that the coverage of the Wikipedia KB was lower compared to that of the UMLS and CHV KBs. Only 3.5% of UMLS and 7.6% of CHV expansion terms were found in Wikipedia. On the other hand, 19.2% and 20.2% of expansion terms from Wikipedia were found within expansion terms from the UMLS and CHV, respectively. Finally, these results also show that each KB promoted mostly different expansion terms.

6.1 Generalisability of the best settings

We have shown that the best settings of query expansion based on Wikipedia, UMLS, CHV, or the combination of Wikipedia and CHV to form the KB, were able to improve retrieval effectiveness, compared to the original CHS queries. We did so by empirically exploring different KB retrieval settings throughout 5 choices, and selecting the best configuration for each choice. Next, we aimed to validate our findings by verifying whether they apply to a different sample of the web and a different set of CHS queries.

To this aim, we applied the best settings we obtained on the CLEF 2016 collection to the CLEF2015 collection. This collection contains 66 queries and a corpus of more than 1 million web pages, sampled from health related websites (rather than a general sample, as in CLEF 2016, i.e. Clueweb09). Table 12 reports the results obtained when applying the best settings for Wikipedia, UMLS, CHV, and the combination of Wikipedia and CHV to the CLEF2015 collection. The results showed that:

  • Independently of the KB, RFHT exhibited improvement, but PRFHT did not. These findings were in line with those from CLEF2016.

  • For the all queries set, without relevance feedback, expanded queries from WSE-Title, CSE-Title, and WikiChv provided gains over the baseline for bpref and RBP@10. However, other than WSE-Title, other expansion methods performed worst for nDCG@10 compared to the baseline.

  • For the high coverage queries set, without relevance feedback, the best settings for CHV (CSE-Title) and for the combination of Wikipedia and CHV (WikiChv) performed better than the baseline for all measures.

Table 12 Performance of the CLEF2016’s best settings for CLEF2015 queries set. Statistical significance differences reported in Table 23

In summary, the above findings show that the settings that were found to best perform on CLEF 2016 did translate to the CLEF 2015 collection.

6.2 Mitigating problems with unjudged documents

The analysis of residuals for expanded queries (top part of Tables 3, 4, 5, 6, 7), along with the analysis in Fig. 4, indicated that the baseline had far less unjudged documents amongst the top 10 results, compared with the EQFE method. We treated unjudged documents as not-relevant; however, given the shallow pools at CLEF 2016, and the fact that the method investigated here did not contribute to the pool (and is substantially different from those that did), there is the possibility that a significant portion of the unjudged documents were, in fact, relevant. To account for this in our analysis of results, along with reporting RBP residuals, we also used bpref (which only considers assessed documents) and further considered the high coverage queries sub-set for each result set (bottom part of Tables 3, 4, 5, 6, 7).

Next, we further analyse our results with respect to unjudged documents, by (1) using the additional relevance assessments made available for this collection in CLEF 2017 (Palotti et al. 2017), and (2) using condensed list evaluation measures (Sakai 2007).


Submission to CLEF 2017


We submitted results from our previous work (Jimmy et al. 2017) to the CLEF 2017 e-Health IR Task 1 (Palotti et al. 2017). In CLEF 2017, the topics from 2016, which we considered in our experiments, were re-used to obtain a deeper and more varied assessment pool. We thus further applied this new set of assessments to study the choices in knowledge based retrieval considered here. Table 13 reports the effectiveness of all expanded queries for Choice 5, using the combined relevance assessments from CLEF 2016 and 2017.

For the all queries set, the top part of Table 13 shows that queries expanded using any of the KBs studied here and without relevance feedback (i.e., WSE-Title, USE-Title, CSE-Title, or WikiChv) performed better than the baseline, on all measures, with the exception of WSE-Title (worse nDCG@10) and USE-Title (worse nDCG@10 and RBP@10).

While the evaluation results from CLEF 2017 have reduced the number of unjudged documents retrieved using expanded queries, we found that residuals from all expanded queries were consistently higher than residuals from the baseline query (see Fig. 6 for which we used the combined CLEF 2016 and 2017 relevance assessments.

We thus turn to analyse the results for the high coverage queries (Table 13, bottom part). For this set, the expanded queries based on any KB and without relevance feedback (i.e., WSE-Title, USE-Title, CSE-Title, or WikiChv) performed better than the baseline on all measures, with the exception of USE-Titlte, which had a lower nDCG@10. Overall, the results from the combined CLEF 2016 and 2017 assessments confirmed our findings as summarised at the beginning of Sect. 6.

Table 13 Influence of choices in KB construction for Choice 5 using the combined CLEF 2016 and 2017 relevance assessments (compare with results from Table 7, where only CLEF 2016 assessments were used). Statistical significance analysis is reported in Tables 24 and 25
Fig. 6
figure 6

Unjudged documents among the top 10 retrieved by runs in Table 13 (top)


Condensed list evaluation


Sakai (2007) suggested computing evaluation measures such as nDCG or average precision on condensed lists, i.e., document rankings obtained by considering only judged documents, as an alternative to bpref for dealing with retrieval results hampered by unjudged documents. We followed this approach for further analyse the results. In Table 14 we report the performance of queries expanded with and without relevance feedback, using condensed list evaluation for precision at 10 (P@10), mean average precision (MAP), nDCG@10 and RBP@10 (For brevity, statistical significant differences are reported in Table 26.). Condensed list results suggest that queries expanded with any KB without relevance feedback (i.e., WSE-Title, USE-Title, CSE-Title, or WikiChv) performed better than the baseline, on all measures. Any relevance feedback method (RF, PRF, RFHT, or PRFHT) could further improve retrieval effectiveness on all measures, with the only exception of applying RF and PRF to WSE-Title, which obtained a lower MAP than when used without PRF (i.e., \(\hbox {WSE-Title} > \hbox {WSE-TitleRF}\), WSE-TitlePRF).

Table 14 Performance of expanded queries with and without relevance feedback, using condensed list evaluation. Statistical significance differences reported in Table 26

7 Conclusions

In this paper, we explored the influence of different choices in knowledge base (KB) retrieval for consumer health search (CHS). Choices included KB construction, entity mention extraction, entity mapping, source of expansion, and relevance feedback. We compared the effectiveness of a general KB (Wikipedia), a medical specialised KB (UMLS) and a consumer health vocabulary (CHV) as the basis for query expansion.

Table 15 Summary of Table 7 comparing results from the baseline and those from the best settings of each KB for all queries set

Our empirical evaluation (as summarised in Table 15) showed that the best settings for the Wikipedia KB are:

  1. 1.

    Index only Wikipedia pages that have health related infobox types or links to medical terminologies.

  2. 2.

    Use uni-, bi-, and tri-grams of the original queries that matched CHV terms as entity mentions.

  3. 3.

    Map entity mentions to Wikipedia entities based on the Aliases feature.

  4. 4.

    Source expansion terms from the mapped Wikipedia page Title.

  5. 5.

    Add relevance feedback terms filtered based on the likelihood of being health related (RFHT).

As for the UMLS KB, the best settings are:

  1. 1.

    Index all UMLS concepts.

  2. 2.

    Use uni-, bi-, and tri-grams of the original queries that matched UMLS terms as entity mentions.

  3. 3.

    Map entity mentions to UMLS entities based on the Aliases feature.

  4. 4.

    Source expansion terms from the mapped UMLS Title feature.

  5. 5.

    Add relevance feedback terms filtered based on the likelihood of being health related (RFHT).

For the CHV KB, the best settings are:

  1. 1.

    Index all CHV concepts that are related to the four key aspects of medical decision criteria.

  2. 2.

    Use uni-, bi-, and tri-grams of the original queries that matched CHV terms as entity mentions.

  3. 3.

    Map entity mentions to CHV entities based on the Aliases feature.

  4. 4.

    Source expansion terms from the mapped CHV Title feature.

  5. 5.

    Add relevance feedback terms filtered based on the likelihood of being health related (RFHT).

Finally, the best combined settings are:

  1. 1.

    Combine expansion terms from the best settings of Wikipedia and CHV (WikiChv).

  2. 2.

    Add relevance feedback terms filtered based on the likelihood of being health related (RFHT).

Our empirical evaluation shows that, overall, combining expansion terms from the best settings of Wikipedia and CHV (WikiChv) was more effective than using expansion terms from the best settings of any individual KB. Using expansion terms from the combined KBs (WikiChv) improved upon the baseline in both bpref (+ 8.7%) and RBP@10 (+ 1.1%); this when using the full query set and without relevance feedback. For high coverage queries, improvements were observed for nDCG@10 (+ 5.7%), bpref (+ 5.7%), and RBP@10 (+ 12.3%). While the best results were observed using the combined, WikiChv, KB, the use of each individual KBs resulted in improvements over their respective baselines on high coverage queries. These findings demonstrate the merit of a knowledge-base retrieval approach in the challenging CHS domain.

The use of relevance feedback with filtering of health related query terms further improved results. For the full query set, expansion with a combined WikiChvRFHT KB improved considerably compared to the baseline: nDCG@10 (+ 51.8%), bpref (+ 29.5%), and RBP@10 (+ 98.2%). For high coverage queries, similar improvements were observed: nDCG@10 (+ 53%), bpref (+ 24.2%), and RBP@10 (+ 82.5%).

The major limitation of our experiments was the number of unjudged documents retrieved using the expanded queries on the CLEF 2016 collection. We addressed this limitation in different ways. When reporting the RBP results, we also reported the residuals: these provide an intuition of how much RBP could be under-estimated because of treating unjudged documents as not relevant. For each set of experiments, we considered also a subset of queries for which a larger portion of assessed documents were retrieved by all approaches. We also further augmented the set of assessed documents from CLEF 2016 with the relevance assessments for the same queries made available as part of CLEF 2017. This evaluation further confirmed the findings obtained when considering only the CLEF 2016 assessments. Finally, we also analysed the retrieval results with respect to a condensed lists-based evaluation (i.e., by considering only judged documents). The condensed list evaluation confirmed our findings that expanded queries with or without (pseudo) relevance feedback from all KB performed better than the baseline. Yet, it remains challenging to fairly evaluate the methods, because of the number of relevance assessments available in the collection. Nevertheless, this work provides an extended investigation into the choices in KB retrieval for CHS, highlighting both what worked and what did not.