Mining subtopics from different aspects for diversifying search results

Wang, Chieh-Jen; Lin, Yung-Wei; Tsai, Ming-Feng; Chen, Hsin-Hsi

doi:10.1007/s10791-012-9215-y

Mining subtopics from different aspects for diversifying search results

Search Intents and Diversification
Published: 18 December 2012

Volume 16, pages 452–483, (2013)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Mining subtopics from different aspects for diversifying search results

Download PDF

Chieh-Jen Wang¹,
Yung-Wei Lin¹,
Ming-Feng Tsai² &
…
Hsin-Hsi Chen¹

789 Accesses
10 Citations
Explore all metrics

Abstract

User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.

Identifying interdisciplinary topics and their evolution based on BERTopic

Article 03 July 2023

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Meta-path automatically extracted from heterogeneous information network for recommendation

Article 13 April 2024

1 Introduction

Users’ queries represented by a few keywords usually contain a certain extent of ambiguity (Spärck-Jones et al. 2007). An ambiguous query may refer to multiple aspects or have more than one interpretation. In this paper, different interpretations and aspects associated with a query are called subtopics. For example, the query “dinosaur” may refer to three subtopics: (1) a paleontological science, (2) a multimedia recreation, and (3) a scenic spot. Each subtopic may also contain several sub–sub-topics, for example, “comics”, “television shows”, “films”, and “music bands” with respect to the subtopic of “a multimedia recreation”. In addition, even a query with a clearly faceted interpretation might still be under-specified, because it is not clear which subtopic of the interpretation is actually desirable for users. For instance, the faceted query “air travel information” may contain different subtopics, such as (1) information on air travel, airports, and airlines, (2) restrictions for checked baggage during air travel, and (3) websites that collect statistics and report about airports.

Traditional information retrieval models, such as the Boolean model and the vector space model typically only consider the relevance between a query and documents. These retrieval models treat every input query as a clear, well-defined representation and completely neglect any sort of ambiguity. This negligence results in the top ranked documents possibly containing too much relevant information on the same subtopic. This might increase users’ search time for distinguishing whether the retrieved documents contain redundant information. In addition, some retrieval models anticipate that the underlying meaning of a submitted query is always the most popular subtopic. These models may focus the retrieval process on popular/particular subtopics too much. This postulation may have risks with a wrong guess (e.g., the user information need is different from the most popular subtopic), which could leave users unsatisfied. For maximizing the satisfaction of different users, a retrieval model has to select a list of documents that are not only relevant to some popular subtopics, but also covers different subtopics. Users can quickly find relevant information that may be interesting if search results are diversified. Nevertheless, how to balance the relevance and the diversity of search results is a trade-off. On the one hand, too many subtopics may provide diversified information but introduce many irrelevant documents, which cause a relevance issue. On the other hand, if only the similarity between a query and documents is considered, too many documents belonging to the same subtopics are retrieved, which causes a diversity issue.

In recent years, using subtopics of a query for diversifying the retrieved documents has received considerable attention (Song et al. 2011). The broad topic associated with an ambiguous or unclear query can be decomposed into a set of subtopics. This provides an opportunity to deal with the problem of search result diversification, as we can employ the clues from the subtopics to rank a diverse ranking list based on optimizing the maximum coverage of the subtopics. In this paper, we introduce a novel framework for search result diversification that exploits the subtopics embedded in queries and ranks the retrieved documents based on these discovered subtopics. Several methods are proposed for mining subtopics from different aspects, such as the retrieved documents, the search query logs, and the related search services provided by the commercial search engines.

Theoretically, the diversified retrieval models should provide a ranking list of documents that has the maximum coverage and minimum redundancy with respect to the possible subtopics underlying a query. Moreover, the covered subtopics should also reflect their relative importance for the query, as perceived from most users (Yin et al. 2009; Agrawal et al. 2009). For example, a query “java” may have three subtopics (a programming language, coffee, and an island), where the subtopic of island attracts less interest than the subtopic of programming language. The subtopic of programming language should be relatively more important than the subtopic of island for the query “java”.

In this paper, we propose a subtopic-based diversified retrieval framework that first uncovers different subtopics embedded in a query, then assigns a weight for each mined subtopic to describe its importance, and finally estimates the relevance of the retrieved documents to each mined subtopic for diversifying search results. The proposed framework not only keeps the quality of relevance, but also re-ranks the top-ranked retrieved results to cover multiple important subtopics. Specifically, there are three components in the proposed diversification framework, the richness of subtopics, the importance of subtopics, and the novelty of subtopics. The richness part aims at measuring how many subtopics are covered by a document, the importance part estimates the importance of the subtopics of a query, and the novelty part computes how many subtopics have already been covered by the previously retrieved documents. With these three aspects, the proposed document ranking algorithm uses a greedy-like strategy to select a list of documents that can cover as many multiple and relatively important subtopics as possible.

We conduct a series of experiments to evaluate the effectiveness of the proposed subtopic mining techniques and the diversification ranking algorithms. The experimental datasets are the ClueWeb09 Category A and Category B test collections with the topics of the TREC09 and TREC10 Web Track diversity tasks. The subtopic-based diversified retrieval framework has a large impact on the effectiveness for search result diversification, as shown in the experimental results. Compared with the state-of-the-art models in the TREC09 and TREC10, the proposed diversified retrieval framework significantly improves the diversity of search results, especially when integrating multiple aspects of resources.

The remainder of this paper is organized as follows. The related works are presented and compared in Sect. 2. Section 3 describes the subtopic mining methods and document diversification algorithms. In Sect. 4, the datasets used for experiments are described. The experimental results are reported and discussed in Sect. 5. We finally conclude our work and provide several directions for future work in Sect. 6.

2 Related works

In this section, we review some query understanding approaches to mining search intents and improving the performance of information retrieval systems. Next, we survey several previous studies about diversifying search results. Finally, we address the contributions of our subtopic-based diversified retrieval framework for search result diversification.

2.1 Query understanding

Many information retrieval models have benefited from taking into account the users’ search intent. These models generally have relied on predefined categories to predict underlying search intents of queries for improving the search performance (Rose and Levinson 2004; Chang et al. 2006). Understanding users’ search intents of queries can be achieved using different types of external resources, e.g., Wikipedia, Open Directory Project (ODP) (Hu et al. 2009).

To realize the meanings of queries, several approaches have adopted taxonomies to classify queries into predefined search intent categories of different granulations. Broder (2002) divided query intent into navigational, informational, and transactional types. Nguyen and Kan (2007) characterized queries along four general facets of ambiguity, authority, temporal sensitivity, and spatial sensitivity. Manshadi and Li (2009) constructed a hybrid, generative grammar model based on probabilistic context-free rules for classifying queries into finer categories. Geng et al. (2008) applied the k-nearest neighbor (k-NN) classification algorithm to assign search intent categories of queries. Given an unseen query, a classifier generated by the k-NN classification algorithm was used to identify which training queries were similar to the unseen query before assigning the search intent of the most similar query to the unseen query. Hu et al. (2009) designed a random walk method using Wikipedia to predict query search intent. Understanding the search intent is helpful to improve search effectiveness. Radlinski and Joachims (2005) mined search intent from query chains and applied it to learning to rank algorithms. Boldi et al. (2008) employed the query-flow graphs to predict the search intent of queries for query recommendation.

2.2 Search result diversification

Diversifying search results has been studied and applied to different applications (Zhai et al. 2003; Radlinski et al. 2009; Santos et al. 2010b). Generally speaking, the previous works of search result diversification can be categorized into implicit or explicit approaches (Santos et al. 2010a). The implicit approaches assume the related documents will contain similar subtopics and can be regarded as redundant information. These similar documents might be demoted in the ranking list for diversification. The explicit approaches, in turn, directly model the subtopics of queries, and search the retrieved documents to maximize the coverage of the subtopics. The main difference between the two approaches is the implicit approaches identify the redundant information between a new document and the previous selected documents without considering the subtopics of a query explicitly.

For the implicit approaches, Carbonell and Goldstein (1998) proposed the maximal marginal relevance (MMR) to rank a retrieved document under a combination of a relevancy score with respect to a query and a dissimilarity score with respect to other similar documents selected at earlier ranks. Zhai et al. (2003) modeled relevance and redundancy based on the KL-divergence measure and a simple mixture model. Yue and Joachims (2008) maximized the word coverage to select the optimum set of diversified documents. The learned model selected documents for covering maximum distinct words with the greedy search. Chen and Karger (2006) presented a selection algorithm based on the Bayesian information retrieval framework for diversifying search results among the top ten previously visited results. Their selection algorithm estimated the documents based on the probability ranking principle and used pseudo-relevance feedback to search result diversification by negative feedback on the redundant documents. Vee et al. (2008) proposed several B+ tree based diversifying models to return a set of different answers on query answering diversification. Gollapudi and Sharma (2009) developed a set of natural axioms for diversification and utilized several diversified functions in their diversification framework. Wang and Zhu (2009) used mean–variance analysis to search result diversification based on the economic portfolio theory. Their algorithm estimated an uncertainty in terms of the “risk” in economic domain trade-off between the expected relevance of a set of retrieved documents and the correlation between them, and it selected the right combination of relevant documents under the uncertainty estimation.

Different from the implicit approaches, subtopics can be modeled explicitly by queries for diversifying search results. Subtopics may be mined from a predefined taxonomy (Agrawal et al. 2009), related sub-queries, and suggested sub-queries (Santos et al. 2010a), etc. Radlinski and Dumais (2006) used the query–query reformulation records in search query logs to discover subtopics, and they diversified the retrieved results for improving effectiveness on personalized search. Agrawal et al. (2009) presented a systematic approach to diversifying search results through modeling subtopic from queries and documents by classification taxonomy for minimizing the risk of dissatisfaction of the average users. Carterette and Chandar (2009) identified subtopics by topic models and used a probabilistic ranking model to maximize the subtopic coverage rate for search result diversification. Santos et al. (2010a, b) uncovered subtopics by the query suggestion from search engines and proposed a probabilistic framework that estimated the diversity based on not only the documents’ relevance to query subtopics, but also the relative importance associated with the query subtopics. Rafiei et al. (2010) identified subtopics by existing taxonomic information and regarded the problem of diversifying results as expectation maximization. They attempted to broaden the coverage of subtopics in the retrieved results. Welch et al. (2011) employed WordNet and Wikipedia to discover subtopics and presented a diversification algorithm especially suitable for informational queries where users may take more than one page to satisfy their needs.

Compared to the studies described above, the major contributions of this work are four-fold.

Exploring various subtopic mining methods: A total of six subtopic mining methods are explored in this paper. These methods mine subtopics from different aspects, such as retrieved documents and external resources.
Analyzing the effectiveness of the subtopic mining methods in depth: We analyze the effectiveness of the subtopics derived from different subtopic mining methods. A user study comparing the subtopics mined by different mining methods with the ground truth in the TREC09 and TREC10 is conducted.
A novel subtopic-based diversified algorithm: The proposed diversification algorithm optimizes the estimation from three aspects, the richness of subtopics, the importance of subtopics, and the novelty of subtopics for diversifying search results.
Thorough evaluation of the document ranking experiments: Through experiments on the ClueWeb09 Category A and Category B test collections with the topics of the TREC09 and TREC10, we demonstrate that our model significantly improves the performance of the state-of-the-art models proposed in the TREC09 and TREC10 Web Track diversity tasks.

3 A diversified retrieval system

Figure 1 shows our subtopic-based diversified retrieval framework, which contains two phases, subtopic mining and document ranking. The ClueWeb09 dataset is indexed by an information retrieval model, and the model reports a set of relevant documents after query submissions. In this study, the number of returned documents is set to one thousand. Several resources are employed to mine the subtopics of a query. The indirect mining methods mine subtopics from the retrieved documents for the given queries, and the direct mining methods mine subtopics from the queries themselves. Three indirect subtopic mining methods, including a clustering-based method, a topic-category-based method, and a concept-tag-based method, are proposed. In addition, three direct subtopic mining methods, including an ontology-based method, a query-logs-based method, and a related-search-based method, are also introduced. The six mining methods are explored in the following sections. After the subtopic mining, we propose two document ranking algorithms based on the mined subtopics for search result diversification.

In the following sections, we begin by introducing how to mine the subtopics of a given query by indirect and direct subtopic mining methods. Then, how to merge and re-rank the retrieved results based on the mined subtopics for search result diversification is described.

3.1 Subtopic mining

In this section, we describe the subtopic mining methods that mine subtopics of a query for supporting the subtopic-based diversified retrieval system. The query “dinosaur”, which was selected from the TREC09 topic set, is taken as an example to illustrate the characteristics of each subtopic mining method.

3.1.1 Clustering-based method

Clustering search results is one of the possible ways to mine subtopics of a given query. Two documents are similar if their representations are similar. Documents of similar representations are grouped together to form a document cluster. The common representation of a document cluster is identified as a subtopic of the query. Given a query, the relevant documents are retrieved, features are extracted from each retrieved document, and the weight of a feature is determined by tf-idf as follows:

$$ w_{i,d} = (0.5 + \frac{{0.5freq_{i,d} }}{{\max_{d} freq}}) \times log\frac{N}{{n_{i} }}, $$

(1)

where freq _i,d is the frequency of feature i in document d, max_d freq is the maximum feature frequency in document d, N is the total number of retrieved documents, and n _i is the number of documents in which feature i appears.

The k-means clustering algorithm (MacQueen 1967) is performed on the retrieved documents, and the documents of similar representations are put together in a document cluster. In this method, the cosine distance determines the similarity between two documents. The number of clusters (e.g., k) has to be determined before the k-means clustering algorithm is applied. The number of clusters is regarded as the number of subtopics within the given query. Intuitively, the number of subtopics depends on the query itself. For example, the query “Obama family tree” may have three subtopics like “TIME magazine photo essay”, “Barack Obama’s parents and grandparents come from”, and “biographical information on Barack Obama”. Another query “kcs” may have five subtopics, such as “Kansas City Southern railroad homepage”, “job information with the Kansas City Southern railroad”, “Kanawha County Schools in West Virginia homepage”, “Knox County School system” and “KCS Energy”. Therefore, finding an appropriate k is an important issue.

This paper considers three strategies to determine the k. These strategies are based on empirical observation and two external resources, i.e., “Google Insight for Search^{Footnote 1}” (GIS) and “Open Directory Project^{Footnote 2}” (ODP). For the empirical observation, we set k to 5, 10, and 20 because our observation shows the number of subtopics of a query is unlikely to be more than twenty. Furthermore, we employ external resources to determine the k. The GIS provides information of a query based on users’ Internet search patterns on Google, such as search volume patterns across specific regions, geographic distribution, and categories for a query. The GIS classifies queries on the Web into 27 categories. We consult the GIS to collect all possible categories of a given query. The number of categories is regarded as the number of its subtopics. The ODP is constructed and maintained by a vast, global community of volunteer editors. It contains more than four million web pages that are organized into more than 500 thousand categories. All websites in the ODP are classified into sixteen major categories. We submit queries to the ODP to collect the major categories of the returned web pages. The number of distinct major categories is regarded as the number of subtopics.

Table 1 lists the collected categories based on the two external resources for the query “dinosaur”. As shown in the table, the query “dinosaur” is classified into six major categories by the GIS and eleven major categories by the ODP. This reflects the fact that the ODP provides more specific categories than the GIS.

Table 1 Categories mined by using the GIS and the ODP for the query “dinosaur”

Mining subtopics from different aspects for diversifying search results

Abstract

Similar content being viewed by others

Identifying interdisciplinary topics and their evolution based on BERTopic

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Meta-path automatically extracted from heterogeneous information network for recommendation

1 Introduction

2 Related works

2.1 Query understanding

2.2 Search result diversification

3 A diversified retrieval system

3.1 Subtopic mining

3.1.1 Clustering-based method

3.1.2 Topic-category-based method

3.1.3 Concept-tag-based method

3.1.4 Ontology-based method

3.1.5 Query-logs-based method

3.1.6 Related-search-based method

3.2 Document ranking

3.2.1 Round-robin for diversification

3.2.2 Subtopics for diversification

4 Experimental resources

5 Experimental results and discussion

5.1 Evaluation of the subtopic mining methods

5.2 Baseline models for document ranking

5.3 Parameter determination

5.4 Evaluation on the ClueWeb09 Category B collection

5.5 Evaluation on the ClueWeb09 Category A collection

6 Conclusion and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation