Advertisement

Information Retrieval

, Volume 16, Issue 4, pp 452–483 | Cite as

Mining subtopics from different aspects for diversifying search results

  • Chieh-Jen Wang
  • Yung-Wei Lin
  • Ming-Feng Tsai
  • Hsin-Hsi ChenEmail author
Search Intents and Diversification

Abstract

User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.

Keywords

Diversified retrieval Subtopic mining Search result re-ranking 

Notes

Acknowledgments

This work was partially supported by National Science Council (Taiwan) and Excellent Research Projects of National Taiwan University under contracts NSC98-2221-E-002-175-MY3, NSC99-2221-E-002-167-MY3, and 101R890858.

References

  1. Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (pp. 5–14).Google Scholar
  2. Bi, W., Yu, X., Liu, Y., Guan, F., Peng, Z., Xu, H., & Cheng, X. (2009). ICTNET at Web Track 2009 diversity track. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  3. Boldi, P., Bonchi, F., Castillo, C., Donato, D., Gionis, A., & Vigna, S. (2008). The query-flow graph: Model and applications. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (pp. 609–618).Google Scholar
  4. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.CrossRefGoogle Scholar
  5. Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 335–336).Google Scholar
  6. Carterette, B. (2009). An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory (pp. 200–211).Google Scholar
  7. Carterette, B., & Chandar, P. (2009). Probabilistic models of ranking novel documents for faceted topic retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (pp. 1287–1296).Google Scholar
  8. Chang, Y. S., He, K. Y., Yu, S., & Lu, W. H. (2006). Identifying user goals from Web search results. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 1038–1041).Google Scholar
  9. Chen, H., & Karger, D. R. (2006). Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 429–436).Google Scholar
  10. Clarke, C., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 web track. In Proceedings of the 18th Text REtrieval Conference. (pp. 1–9).Google Scholar
  11. Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the TREC 2010 web track. In Proceedings of the 19th Text REtrieval Conference (pp. 1–9).Google Scholar
  12. Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., & MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 659–666).Google Scholar
  13. Craswell, N., Fetterly, D., Najork, M., Robertson, S., & Yilmaz, E. (2009a). Microsoft Research at TREC 2009: Web and Relevance Feedback Track. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  14. Craswell, N., Jones, R., Dupret, G., & Viegas, E. (2009b). In Proceedings of the 2009 Workshop on Web Search Click Data (pp. 95).Google Scholar
  15. Dou, Z., Chen, K., Song, R., Ma, Y., Shi, S., & Wen, J. R. (2009). Microsoft research Asia at the web track of TREC 2009. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  16. Geng, X., Liu, T. Y., Qin, T., Arnold, A., Li, H., & Shum, H. Y. (2008). Query dependent ranking using K-nearest neighbor. In Proceedings of the 31st a Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115–122).Google Scholar
  17. Gollapudi, S., & Sharma, A. (2009). An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web (pp. 381–390).Google Scholar
  18. He, J., Balog, K., Hofmann, K., Meij, E., Rijke, M. de, Tsagkias, M., & Weerkamp, W. (2009). Heuristic ranking and diversification of web documents. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  19. Hu, J., Wang, G., Lochovsky, F., Sun, J., & Chen, Z. (2009). Understanding user’s query intent with wikipedia. In Proceedings of the 18th International Conference on World Wide Web (pp. 471–480).Google Scholar
  20. Kamps, J., Kaptein, R., & Koolen, M. (2010). Using anchor text, spam filtering and wikipedia for web search and entity ranking. In Proceedings of the 19th Text REtrieval Conference.Google Scholar
  21. Li, Z., Chen, F., Xing, Q., Miao, J., Xue, Y., Zhu, T., Zhou, B., (2009). Thuir at trec 2009 web track: Finding relevant and diverse results for large scale web search. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  22. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Math, Statistics, and Probability (Vol. 1, pp. 281–297).Google Scholar
  23. Manshadi, M., & Li, X. (2009). Semantic tagging of web search queries. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 861–869).Google Scholar
  24. Middleton, C., & Baeza-Yates, R. (2007). Technical report: A comparison of open source search engines. Retrieved from http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf.
  25. Nguyen, V., & Kan, M. Y. (2007). Functional faceted web query analysis. In Query Log Analysis: Social and Technological Challenges. A workshop at the 16th International World Wide Web Conference.Google Scholar
  26. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–281).Google Scholar
  27. Radlinski, F., Bennett, P. N., Carterette, B., & Joachims, T. (2009). Redundancy, diversity and interdependent document relevance. SIGIR Forum, 43(2), 46–52.CrossRefGoogle Scholar
  28. Radlinski, F., & Dumais, S. (2006). Improving personalized web search using result diversification. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 691–692).Google Scholar
  29. Radlinski, F., & Joachims, T. (2005). Query chains: Learning to rank from implicit feedback. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 239–248).Google Scholar
  30. Rafiei, D., Bharat, K., & Shukla, A. (2010). Diversifying web search results. In Proceedings of the 19th International Conference on World Wide Web (pp. 781–790).Google Scholar
  31. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th International Conference on World Wide Web (pp. 13–19).Google Scholar
  32. Santos, R. L. T., Macdonald, C., & Ounis, I. (2010a). Exploiting query reformulations for web search result diversification. In Proceedings of the 19th International Conference on World Wide Web (pp. 881–890).Google Scholar
  33. Santos, R. L. T., Macdonald, C., & Ounis, I. (2010b). Selectively diversifying web search results. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (pp. 1179–1188).Google Scholar
  34. Santos, R. L. T., McCreadie, R. M. C., Macdonald, C., & Ounis, I. (2010c). University of Glasgow at TREC 2010: Experiments with terrier in blog and web tracks. In Proceedings of the 19th Text REtrieval Conference.Google Scholar
  35. Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q. (2011). Overview of the NTCIR-9 INTENT Task. In Proceedings of the 9th NTCIR Workshop Meeting.Google Scholar
  36. Spärck-Jones, K., Robertson, S. E., & Sanderson, M. (2007). Ambiguous requests: implications for retrieval tests, systems and theories. SIGIR Forum, 41(2), 8–17.CrossRefGoogle Scholar
  37. Turtle, H., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems (TOIS), 9(3), 187–222.CrossRefGoogle Scholar
  38. Vee, E., Srivastava, U., Shanmugasundaram, J., Bhat, P., & Yahia, S. A. (2008). Efficient computation of diverse query results. In Proceedings of the 24th IEEE International Conference on Data Engineering (pp. 228–236).Google Scholar
  39. Wang, J., & Zhu, J. (2009). Portfolio theory of information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115–122).Google Scholar
  40. Welch, M. J., Cho, J., & Olston, C. (2011). Search result diversity for informational queries. In Proceedings of the 20th International Conference on World Wide Web (pp. 237–246).Google Scholar
  41. Yin, D., Xue, Z., Qi, X., & Davison, B. D. (2009). Diversifying search results with popular subtopics. In Proceedings of the 18th Text REtrieval Conference.Google Scholar
  42. Yue, Y., & Joachims, T. (2008). Predicting diverse subsets using structural SVMs. In Proceedings of the 25th International Conference on Machine Learning (pp. 1224–1231).Google Scholar
  43. Zhai, C. X., Cohen, W. W., & Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (pp. 10–17).Google Scholar
  44. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Chieh-Jen Wang
    • 1
  • Yung-Wei Lin
    • 1
  • Ming-Feng Tsai
    • 2
  • Hsin-Hsi Chen
    • 1
    Email author
  1. 1.Department of Computer Science and Information EngineeringNational Taiwan UniversityTaipeiTaiwan
  2. 2.Department of Computer Science and Program in Digital Content and TechnologiesNational Chengchi UniversityTaipeiTaiwan

Personalised recommendations