Background

Current genomic research is characterized by immense volume of data, accompanied by a tremendous increase in the number of genomic and biomedical related publications. This wealth of information has led to an increasing amount of interest and need for applying information retrieval (IR) techniques to access the scientific literature in genomics and related biomedical disciplines.

Given a query, an IR system returns a ranked list of retrieved documents to users. Retrieved documents are ranked in the order of their probabilities of relevance to the query. Traditional retrieval models assume that the relevance of a document is independent of the relevance of other documents. However, in reality, this assumption may not hold. The usefulness of retrieving a document usually depends on previous ranked documents, since a user may want to see the top ranked documents concerning different aspects of his/her information need instead of reading relevant documents that only deliver redundant information. A better information retrieval system thus should return ranked lists that respect both the query-relevance and the breadth of available information.

In the biomedical domain, the desired information of a question (query) asked by biologists usually is a list of a certain type of entities covering different aspects that are related to the question [1], such as genes, proteins, diseases, mutations, etc. Hence it is important for a biomedical IR system to be able to provide comprehensive and diverse answers to fulfill biologists’ information need.

To address this problem, in the most recent TREC Genomics tracks, the “aspect retrieval” was investigated. Its purpose is to study how a biomedical retrieval system can support a user gathering information about the different aspects of a topic. We consider the aspects of a document as concepts, entities or topics contained in the document. In the Genomics tracks, biomedical IR systems were required to return relevant information at the passage level, while relevance judges not only rated the passages, but also grouped them by aspect. Aspects of a retrieved passage could be a list of named entities or MeSH terms, representing answers that cover different portions of a full answer to the query. Aspect Mean Average Precision (Aspect MAP) was defined in the Genomics tracks to capture similarities and differences among retrieved passages. It indicates how comprehensive the questions are answered. Relevant passages that do not contribute any new aspects to the aspects retrieved by higher ranked passages will not be used to accumulate Aspect MAP [2]. Therefore, Aspect MAP is a measurement for redundancy and diversity of the IR ranked list.

Our work is inspired by several recent papers that concerned with promoting diversity and novelty in the IR ranked list. Carbonell et al. introduced the maximal marginal relevance (MMR) method, which attempted to maximize relevance while minimizing similarity to higher ranked documents [3]. In order to measure the redundancy between documents, Zhang et al. presented four redundancy measures, which were “set difference”, “geometric distance”, “distributional similarity” and “a mixture model” [4]. They modeled relevance and redundancy separately. Since they focused on redundant document filtering, experiments in their study were conducted on a set of relevant documents. However, in reality, non-relevant documents are always returned by IR systems along with relevant documents. Redundancy and relevance should both be considered. Zhai et al. validated a subtopic retrieval method based on a risk minimization framework [5]. Their subtopic retrieval method combined the mixture model novelty measure with the query likelihood relevance ranking. More recently, a new diversity task of Web retrieval was defined in the TREC 2009 Web track [6]. Two evaluation measures, α-nDCG [7] and an intent-aware version of precision (IA-P) [8], both of which reward novelty and diversity, were validated in the diversity task of the 2009 Web track. Top diversity task results showed that re-ranking methods based on anchor text, sites of search results, link filtering, clustering and sub-queries suggestion were effective in Web retrieval result diversification [912]. However, these studies mainly focused on Web search and did not take characteristics of genomics search into account. How to promote ranking diversity in the biomedical information retrieval still need to be further investigated.

In biomedical Information retrieval, the Genomics aspect retrieval was firstly proposed in the TREC 2006 Genomics track and further investigated in the 2007 Genomics track. Many research groups joined these annual campaigns to evaluate their systems and methodologies. However, to the best of our knowledge, there is not too much previous work conducted on the Genomics aspect retrieval for promoting diversity in the ranked list. University of Wisconsin re-ranked the retrieved passages using a clustering-based approach named GRASSHOPPER to promote ranking diversity [13]. GRASSHOPPER was an alternative to MMR and variants with a principled mathematical model and strong empirical performance on artificial data set [14]. Unfortunately, for the Genomics aspect retrieval, this re-ranking method hurt their system’s performance and decreased the Aspect MAP of the original results [13]. Later in the TREC 2007 Genomics track, most teams tried to obtain the aspect level performance through their passage level results, instead of working on the aspect level retrieval directly [1, 15, 16]. Another study concerning with the Genomics aspect retrieval was conducted in [17]. Their experimental results demonstrated that the hidden property based re-ranking method can achieve promising performance improvements.

In our preliminary study, we showed that Wikipedia can be used as an external knowledge resource to facilitate biomedical IR [18]. However, how to combine the novelty and the relevance of a document for maximizing effectiveness of IR systems remains a challenging research question.

Methods

Datasets and evaluation measures

In order to evaluate the proposed approach for promoting ranking diversity in biomedical information retrieval, we use the TREC 2006 and 2007 Genomics track collection as the test corpus. It is a full-text biomedical corpus consisting of 162,259 documents from 49 genomics-related journals indexed by MEDLINE [1, 2]. 28 official topics from the 2006 Genomics track and 36 official topics from the 2007 Genomics track are used as queries. Topics are in the form of questions asking for lists of specific entities that cover different portions of full answers to the topics [1, 2].

There were three levels of retrieval performance that were measured in the TREC 2006 and 2007 Genomics tracks: passage retrieval, aspect retrieval and document retrieval. Each was measured by some variants of mean average precision (MAP). Passage MAP, Passage2 MAP(Passage2 MAP was defined in the TREC 2007 Genomics track, which is an alternative measure to the Passage MAP defined in the TREC 2006 Genomics track.), Aspect MAP and Document MAP were four evaluation measures corresponding to the three levels of retrieval performance. The definitions of these MAPs can be found in [2] and [1]. In this paper, we mainly focus on aspect level and passage level retrieval performance, since our objective is to promote diversity in the ranked list of retrieved passages. Moreover, aspect retrieval and passage retrieval were also the major tasks of these two Genomics tracks.

Genomics collections only present a fraction of millions of biomedical literatures indexed by MEDLINE. However, to the best of our knowledge, they are the largest and the only biomedical text collections with both manual relevance assessments and diversity evaluation available for biomedical text retrieval research so far.

Baseline runs

For the 2007’s topics, three IR baseline runs are used. NLMinter [15] and MuMshFd [19] were two of the most competitive IR runs submitted to the TREC 2007 Genomics track. NLMinter developed by the U.S. National Library of Medicine [15] achieved the best performance in the TREC 2007 Genomics track in terms of Aspect MAP, Passage2 MAP and Document MAP [1, 15]. It merged the retrieval results obtained by Essie [20], Indri, Terrier [21], Theme, and EasyIR [22] and employed a human-involved relevance feedback method. Another IR baseline run is an Okapi run, which is solely based on the probabilistic weighting model BM25 [23]. The performance of the Okapi run is also above average among all results reported in the TREC 2007 Genomics track [1].

For 2006’s topics, we test our approach on three Okapi runs since other retrieval results submitted to the TREC 2006 Genomics track are not available. In order to find out wether the proposed methods can work well on strong baselines as well as on average and weak baselines, we set different values to BM25 parameters to obtain different baselines [24]. The performance of the baseline run Okapi06b is also among the top performances reported in the TREC 2006 Genomics track [2].

The performance of baseline runs are shown in Table 1. The best and mean results reported in the 2006 and 2007 Genomics tracks are shown in Table 2.

Table 1 Performance of IR baseline runs
Table 2 The best and mean results in the Genomics tracks

Aspect detection

As described above, we focus on the aspect retrieval for promoting ranking diversity. Therefore, it is necessary to detect aspects contained in a document. Because of the frequent use of acronyms, the presence of homonyms and synonyms in biomedical literatures, using terms that appear in documents as aspects for re-ranking would not be effective. For example, when the term “AIDS” appears in a document, it may indicate “human immunodeficiency virus (HIV)” or “the medical helps given to patients”. Apparently, using the bag-of-word method could not capture the semantic meaning of terms. In the following, we use Wikipedia for aspect detection.

Wikipedia is a free online encyclopedia edited collaboratively by large numbers of volunteers. The exponential growth and the reliability of Wikipedia make it a potentially valuable knowledge resource. How to utilize Wikipedia to facilitate information retrieval became a hot research topic over the last few years [2527].

However, as far as we are aware, there is no work done on investigating how to use Wikipedia to improve biomedical IR performance. The main reason for this is that some domain-specific thesauri are available for biomedical retrieval (e.g. UMLS, MeSH and the Gene Ontology). Nonetheless, these domain-specific thesauri only provide synonyms, hypernyms, hyponyms of a specific term without any other context. Therefore, it is hard to tell which lexical variants of a specific term should be used for retrieving users’ information need. Previous studies showed that lexical variants from domain-specific thesauri were usually assigned manually to achieve performance improvements [28, 29]. The retrieval results of using domain-specific thesauri are somewhat conflicting [15, 19, 30].

Wikipedia, on the other hand, not only provides concepts (entities) and lexical variants of a specific term, but also provides abundant contexts. With the help of enriched entity pages, it is possible to identify which concepts and lexical variants are related under a specific context. As Wikipedia articles are constantly being updated and new entries are created everyday [27], we can expect that Wikipedia covers the great majority of medical terms. Another reason of using Wikipedia is that it contains plenty of linkage information among semantic related entities. Each link in Wikipedia is associated with an anchor text, which can be regarded as a descriptor of its target article. Anchor texts provide alternative names, morphological variations and related phrases for the target articles. Anchors also encode polysemy, because the same anchor may link to different articles depending on the context in which it is found [31]. Using Wikipedia for aspect detection, there are three steps involved:

  1. (1)

    identifying candidate phrases in the given retrieved document;

  2. (2)

    mapping them to Wikipedia articles;

  3. (3)

    selecting the most salient concepts.

The outcome is a set of concepts representing the aspects mentioned in the input documents [26, 31]. The Wikify service provided by the Wikipedia Miner(http://wikipedia-miner.sourceforge.net) is used to automatically detect aspects covered by retrieved documents.

An example is shown in Table 3. Terms that can be linked to their corresponding Wikipedia concepts are displayed in bold font. Although “SLE” has more than twenty distinct entities in Wikipedia, in this example, “SLE” is successfully linked to Wikipedia concept “Systemic lupus erythematosus”. This is because the contexts provided by the enriched Wikipedia entity pages help the disambiguation. Terms “lupus”, “nephritis”, “antibodies”, “serum” and “renal disease” are contained in both the “Systemic lupus erythematosus” Wikipedia page and the retrieved passage. While for other “SLE” related entities in Wikipedia, e.g. “Sober living environment” and “Supported leading edge”, we can barely find any common terms between the entity page and the retrieved passage.

Table 3 An example of aspect detection using Wikipedia

The relevance-novelty combined model

The proposed RelNov model is based on an undirected probabilistic graphical model (Markov Random Field). A graphical model is a graph that models the joint probability distribution over a set of random variables. Nodes in the graph are a set of random variables and missing edges between nodes represent conditional independencies. The joint density can be factorized over the cliques of the graph.

In order to promote ranking diversity in the ranked list, we consider that the document ranking should depend on which documents the user has already seen. As shown in Figure 1, the proposed RelNov model represents the joint probability of θ d , θ 0 , R and N, which denote the document model of the retrieved document, the document model of previous ranked documents, the relevance of the document and the novelty of the document respectively. Edges in the graph define conditional independence assumptions between the variables. The joint distribution across potential functions in the graph represents the probability of a document being relevant to a biologist’s information need as well as being novel given the previous ranked documents.

Figure 1
figure 1

The RelNov model. RelNov model represents the joint probability of θ d , θ 0 , R and N and edges define conditional independence assumptions between the variables.

We then represent the document model θ d as θ t and θ a to capture both the lexical information and conceptual information in a retrieved document, where θ t indicates the term-based document model and θ a indicates the aspect-based document model (similarly, θ d 0 is presented as θ t 0 and θ a 0). Since we consider the aspects of a document as concepts, entities or topics contained in the document, θ a models the conceptual information of a document. Therefore, the RelNov model can be represented as two component models shown in Figure 2, namely the term-aspect relevance model and the term-aspect novelty model.

Figure 2
figure 2

The term-aspect relevance model and the term-aspect novelty model. They are the two component models of the RelNov Model.

Based on conditional independence assumptions, the joint probability distribution is written as a product of potential functions over the maximal cliques in the graph.

(1)

where ϕ(c) is a positive potential function for a clique in the graph, C is a set of cliques in the graph, and Z is a normalization constant.

Potential functions in the RelNov model are defined for the cliques:

(2)

As noted above, all potential functions must be non-negative, and are most commonly parameterized as:

(3)

where f(c) is a real-valued feature function over clique values and ω c is the weight given to that particular feature function. Therefore, Equation (1) can be written as:

(4)

In the following, we present the RelNov model’s component models, the aspect-term relevance model and the aspect-term novelty model.

Aspect-Term relevance model

The aspect-term relevance model corresponds to the cliques ϕ(θ t , R) and ϕ(θ a , R). The feature function for each clique can be written as:

(5)

where θ u denotes the aspect-based document model or the term-based document model and u i denotes an aspect or a term in the document. Since we do not usually have relevance information, P(u|R) is unavailable. One possible solution, as introduced in [32], is to consider that P(R|u i ) ≈ P(Q|u i ). Equation (5) thus can be re-written as:

(6)

where P(d j |Q) is the relevance model presenting whether a retrieved document d j ( j = 1, 2, …, N; where N is the number of retrieved documents) is relevant to the query Q; P(u i |d j , Q) is the co-occurrence model presenting whether an aspect or a term u i is associated with the query.

The relevance model can be estimated using the baseline ranking scores of retrieved documents. To estimate the co-occurrence model, a linear interpolation of P(u i |d j , Q) and the query background information is used as:

(7)

where U denotes the set of aspects or terms for the query; D denotes the set of retrieved documents for the query; freq(x, y) denotes the frequency of x in y; df(u i ) denotes the document frequency of an aspect or a term u i . We use a Dirichlet prior for the smoothing parameter µ.

(8)

where κ is the average aspect frequency of all aspects or terms in the retrieved documents.

Aspect-Term novelty model

The aspect-term novelty model shown in Figure 2(b) aims to provide users with novel information instead of redundant information by promoting the diversity in the IR ranked list.

We consider that the novelty of the i th document depends on the i – 1 documents the user has already seen. Let θ u 0 = {θ u 1, θ u 2 , …, θ u i–1} be the aspect-based or term-based document models for previous ranked documents and θ u be the aspect-based or term-based document model for the i th document d i . Now, we need to measure how much novel information is contained in d i . The feature function of the aspect-term novelty model can be written as:

(9)

Three obvious possibilities for combining the individual novelty scores are taking the minimum, maximum, and average. Taking the average has been shown to be more effective than taking the minimum and maximum [5]. Therefore, Equation (9) can be re-written as:

(10)

Various novelty measures can be used to calculate value N (θ u ;θ u o). In this paper, we choose the mixture model [4] as the novelty measure. Mixture model is a plausible novelty measure, which outperforms several commonly used novelty measures, e.g. set difference, KL divergency and geometry distance. It assumes that the new document is generated by a two-component mixture model, in which one component is the previous ranked document model and the other is a background language model [4][5].

Results and Discussion

Re-Ranking performances

In order to set the parameter values for ω c , we train ω c for 2007’s topics on 2006’s topics and train ω c for 2006’s topics on 2007’s topics by directly maximizing Aspect Mean Average Precision [33]. Note that, for each year’s topics, ∑ c C ω c = 1 and a simple coordinate-level hill climbing search is used to optimize Aspect MAP [34]. Evaluation results of using the proposed RelNov model for document re-ranking on 2007’s topics are shown in Table 4. For 2007’s topics, ω c for each potential function in Equation (2) are set to 0.35, 0.4, 0.2 and 0.05 respectively based on the training process on 2006’s topics. The values in the parentheses are the relative rates of improvement over the original results. For 2006’s topics, re-ranking results and improvements are shown in Table 5. Based on the training process on 2007’s topics, ω c for each potential functions are set to 0.3, 0.35, 0.3 and 0.05 respectively. As we can see from these two tables, our approach achieves promising and consistent performance improvements over all baseline runs. Performance improvements can be observed on all levels of evaluation measures. It is worth mentioning that our approach can further improve the best result (NLMinter) reported in the TREC 2007 Genomics track by achieving 16. 4% improvement on Aspect MAP and 9. 8% improvement on Passage2 MAP. Experimental results demonstrates that our approach not only promotes diversity of ranked lists, but also improves relevance of retrieval results.

Table 4 Re-ranking performance on 2007’s topics with aspect detection using Wikipedia
Table 5 Re-ranking performance on 2006’s topics with aspect detection using Wikipedia

We also note that, in terms of Aspect MAP, the improvements on the 2007’s topics are more significant than the improvements on the 2006’s topics. This might be due to that the average number of distinct aspects of each 2007’s topic (72.3 aspects per topic) is much larger than that of each 2006’s topic (27.9 aspects per topic) [1, 2]. A topic with more distinct aspects indicates the information need of this topic could be more diverse. In this case, our approach performs better.

Effect of the use of aspects

In order to capture the conceptual information of retrieved documents, we use Wikipedia concepts to present the aspects covered by retrieved documents. The advantage of using Wikipedia for aspect detection is that Wikipedia not only provides concepts (entities) and lexical variants of a specific term, but also provides abundant contexts. With the help of enriched entity pages, it is possible to identify which concepts and lexical variants are related under a specific context [18]. From Figure 3 and 4, we can see that performance improvements can be achieved when we only use aspects for re-ranking. Nonetheless, the best performances are achieved when both terms and aspects are used. Moreover, re-ranking without aspect (only using terms in retrieved documents) may hurts the retrieval performance. This might be due to the frequent use of (possibly non-standardized) acronyms, the presence of homonyms and synonyms in biomedical literatures. Therefore, using Wikipedia (with enriched contexts provided by entity pages) to detect aspects and presenting them with Wikipedia concepts plays an important role in the re-ranking.

Figure 3
figure 3

Effects of the use of aspects on 2007s topics. The x-axis presents the evaluation measures, where “NLM”, “MuM” and “Oka” in the left figure stand for three baselines corresponding to NLMinter, MuMshFd and Okapi07.

Figure 4
figure 4

Effects of the use of aspects on 2006s topics. The x-axis presents the evaluation measures, where “06a”, “06b” and “06c” in the right figure stand for three baselines corresponding to Okapi06a, Okapi06b and Okapi06c.

Compare with aspect detection using UMLS

Our experimental results have demonstrated that aspect detection using Wikipedia is effective in result diversification. However, in biomedical IR, the use of domain-specific thesauri is still the most commonly used method of integrating external knowledge. Therefore, it is worthwhile to compare the re-ranking performance based on aspect detection using Wikipedia and using domain-specific thesauri. We use the largest thesaurus UMLS(http://www.nlm.nih.gov/pubs/factsheets/umls.html) in the biomedical domain as the knowledge resource for aspect detection. In practice, we use MetaMap [35], a program developed at the National Library of Medicine (NLM), to map biomedical text to the thesaurus or, equivalently, to discover thesaurus concepts referred to in text. The UMLS Metathesaurus is used as MetaMap’s biomedical knowledge resource. It includes the NCBI taxonomy, Gene Ontology, the Medical Subject Headings (MeSH), OMIM and the Digital Anatomist Symbolic Knowledge Base. Research showed that MetaMap is an effective tool for discovering thesaurus concepts in text [35].

Re-ranking results based on aspect detection using the UMLS are presented in Table 6 and 7. As we can see, when the UMLS is used for aspect detection, performance improvements can be obtained in terms of Aspect MAP and Passage2 MAP. However, Passage MAP and Document MAP may decrease on some baselines. We also find that, compared with aspect detection using the UMLS, aspect detection based on Wikipedia can achieve more significant and more stable performance improvements. This is because the enriched entity pages in Wikipedia could result in a better mapping between terms in biomedical text and concepts. Moreover, instead of only providing hierarchical relationships (synonyms, hypernyms, hyponyms) among biomedical concepts like the UMLS, plenty of Wikipedia links and anchor texts can also provide more natural relationships among Wikipedia concepts.

Table 6 Re-ranking performance on 2007’s topics with aspect detection using UMLS
Table 7 Re-ranking performance on 2006’s topics with aspect detection using UMLS

Comparison with the subtopic retrieval method

The subtopic retrieval method proposed by Zhai et al. in [5] combined relevance scores from a retrieval baseline with novelty scores from the mixture model using a cost-based method. Their work was based on the maximum marginal relevance (MMR) ranking function [3] and argued for the value of diversity. The subtopic retrieval method was shown to be effective in promoting diversity in the ranked list [5]. In order to further evaluate the proposed approach, we compare it with the subtopic retrieval method.

The comparison results shown in Figure 5 and 6 illustrate that our approach outperforms the subtopic retrieval method on three levels retrieval. The advantage of our approach is more significant in terms of Aspect MAP. This indicates that our approach is more effective in promoting ranking diversity for biomedical IR. In our approach, aspects covered by retrieved documents are presented by corresponding Wikipedia concepts. On the other hand, the MMR method employs text similarity as the novelty measure, which uses terms in the retrieved documents to compute document novelty. Because of the frequent use of (possibly non-standardized) acronyms, the presence of homonyms and synonyms in biomedical literatures, using Wikipedia (with enriched contexts provided by entity pages) to detect aspects and presenting them with Wikipedia concepts could result in better biomedical IR performance.

Figure 5
figure 5

Comparison with the subtopic retrieval method on 2007’s topics. The x-axis presents the evaluation measures, where “NLM”, “MuM” and “Oka” in the left figure stand for three baselines corresponding to NLMinter, MuMshFd and Okapi07.

Figure 6
figure 6

Comparison with the subtopic retrieval method on 2006’s topics. The x-axis presents the evaluation measures, where “06a”, “06b” and “06c” in the right figure stand for three baselines corresponding to Okapi06a, Okapi06b and Okapi06c.

Conclusions

In this paper, we present a relevance-novelty combined model based on an undirected graphical model for promoting ranking diversity in genomics search. The proposed combination model, namely RelNov, models aspects, terms, topic relevance and document novelty as potential functions. Specifically, we propose a two-stage model for modeling aspect-term topic relevance and use the mixture model to measure aspect-term novelty. Experimental results demonstrate that the proposed approach is effective in promoting ranking diversity as well as in improving relevance of ranked lists for genomics search. The use of aspect also plays an important role in the model. Moreover, the proposed model is flexible enough, which can integrate various different relevance and novelty measures easily.