Skip to main content

Coverage-based query subtopic diversification leveraging semantic relevance

Abstract

Generally, users are reserved in describing their search intention when submitting queries into the search engine. Therefore, a large number of search queries are usually short, ambiguous and tend to have multiple interpretations. With the gigantic size of the web, ignoring the information needs underlying such queries can misguide the search engine. To mitigate these issues, an effective approach is to diversify the search results considering the query subtopics with diverse intents. The task of identifying possible subtopics with diverse intents underlying a query is known as subtopic mining. This paper is aimed at mining and diversifying subtopics underlying a query. Our method first exacts noun phrases containing the query terms from the top-retrieved web documents. We also extract query suggestions and completions from commercial search engines. The extracted candidates highly related to the query are then selected as subtopics. We introduce a new relatedness score function to estimate the degree of relatedness between the query and the candidate. To estimate the relevancy between the query and the subtopic, this paper introduces a semantic relevance measure using a locally trained sentence embedding model. Finally, we propose a novel coverage-based diversification technique to rank the subtopics combining their relevancy and the coverage estimated by the web documents. The experimental results on two NTCIR English subtopic mining datasets demonstrate that our proposed method achieves new state-of-the-art performance and significantly outperforms some known related methods in terms of relevance (D-nDCG) and diversity (D#-nDCG) metric at cut of 10.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. http://lemurproject.org/clueweb12/.

  2. https://www.lemurproject.org/indri/.

  3. http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/.

  4. http://opennlp.apache.org/.

  5. http://www.lemurproject.org/stopwords/stoplist.dft.

  6. https://radimrehurek.com/gensim/models/doc2vec.html.

References

  1. Barr C, Jones R, Regelson M (2008) The linguistic structure of english web-search queries. In Proceedings of the conference on empirical methods in natural language processing, Association for computational linguistics, pp 1021–1030

  2. Bendersky M, Croft W B, Diao Y (2011) Quality-biased ranking of web documents. In: Proceedings of the fourth ACM international conference on web search and data mining, ACM, pp 95–104

  3. Bouchoucha A (2015) Diversified query expansion

  4. Clarke CL, Nick C, Ian S (2009) Overview of the trec 2009 web track. Technical report, DTIC Document

  5. Damien A, Zhang M, Liu Y, Ma S (2013) Improve web search diversification with intent subtopic mining. In: Natural language processing and Chinese computing, Springer, pp 322–333

  6. Das S, Mitra P, Giles C L (2012) Phrase pair classification for identifying subtopics. In: European conference on information retrieval, Springer, pp 489–493

  7. Drosou Marina, Pitoura Evaggelia (2012) Disc diversity: result diversification based on dissimilarity and coverage. Proc VLDB Endow 6(1):13–24

    Article  Google Scholar 

  8. Gavankar C, Li Y-F, Ramakrishnan G (2016) Explicit query interpretation and diversification for context-driven concept search across ontologies. In: International semantic web conference, Springer, pp 271–288

  9. He J, Hollink V, de Vries A (2012). Combining implicit and explicit topic representations for result diversification. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 851–860

  10. Hu S, Dou Z, Wang X, Sakai T, Wen J-R (2015) Search result diversification based on hierarchical intents. In: Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, pp 63–72

  11. Hu Y, Qian Y, Li H, Jiang D, Pei J, Zheng Q (2012) Mining query subtopics from search log data. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 305–314

  12. Jiang Z, Dou Z, Zhao X, Nie J-Y, Yue M, Wen J-R (2018) Supervised search result diversification via subtopic attention. IEEE Trans Knowl Data Eng 30(10):1971–1984

    Article  Google Scholar 

  13. Kim Se-Jong, Lee Jong-Hyeok (2015) Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents. Inf Process Manag 51(6):773–785

    Article  Google Scholar 

  14. Kim S-J, Shin J, Lee J-H (2016) Subtopic mining based on three-level hierarchical search intentions. In: European conference on information retrieval, Springer, pp 741–747

  15. Tessa L, Eric H (1999) Patterns of search: analyzing and modeling web query refinement. In: UM99 User Modeling, Springer, pp 119–128

  16. Quoc VL, Tomas M (2014) Distributed representations of sentences and documents. In: ICML, vol 14, pp 1188–1196

  17. Liu Y, Song R, Zhang M, Dou Z, Yamamoto T, Kato MP, Ohshima H, Zhou K (2014) Overview of the ntcir-11 imine task. In: Proceedings of NTCIR. Citeseer

  18. Manabe T, Tajima K (2016) Subtopic ranking based on hierarchical headings. In: Proceedings of the 12th international conference on web information systems and technologies, WEBIST, pp 121–130

  19. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  20. Moreno JG, Dias G (2016) Search intent mining by word vectors clustering at ntcir-imine. In: The 12th NTCIR conference

  21. Moreno JG, Dias G, Cleuziou G (2014) Query log driven web search results clustering. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, ACM, pp 777–786

  22. Qian Y, Sakai T, Ye J, Zheng Q, Li C (2013) Dynamic query intent mining from a search log stream. In: Proceedings of the 22nd ACM international conference on information & knowledge management, ACM, pp 1205–1208

  23. Radlinski F, Szummer M, Craswell N (2010) Inferring query intent from reformulations and clicks. In: Proceedings of the 19th international conference on world wide web, ACM, pp 1171–1172

  24. Pengjie Ren, Zhumin Chen, Jun Ma, Shuaiqiang Wang, Zhiwei Zhang, Zhaochun Ren (2015) Mining and ranking users’ intents behind queries. Inf Retr J 18(6):504–529

    Article  Google Scholar 

  25. Ren X, Wang Y, Yu X, Yan J, Chen Z, Han J (2014) Heterogeneous graph-based intent learning with queries, web pages and wikipedia concepts. In: Proceedings of the 7th ACM international conference on web search and data mining, ACM, pp 23–32

  26. Sakai T (2011) Ntcireval: a generic toolkit for information access evaluation. In: Proceedings of the forum on information technology, vol 2, pp 23–30, Citeseer

  27. Sakai T, Song R (2011) Evaluating diversified search results using per-intent graded relevance. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 1043–1052

  28. Sakai Tetsuya, Song Ruihua (2013) Diversified search evaluation: Lessons from the ntcir-9 intent task. Inf Retr 16(4):504–529

    Article  Google Scholar 

  29. Sakai T, Dou Z, Yamamoto T, Liu Y, Zhang M, Kato MP, Song R, Iwata M (2013) Summary of the ntcir-10 intent-2 task: Subtopic mining and search result diversification. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information rretrieval, ACM, pp 761–764

  30. Sakai T, Dou Z, Yamamoto T, Liu Y, Zhang M, Song R, Kato MP, Iwata M (2013) Overview of the ntcir-10 intent-2 task. In: Proceedings of NTCIR

  31. Shajalal M, Ullah MZ, Chy AN, Aono M (2016) Query subtopic diversification based on cluster ranking and semantic features. In: 2016 International conference on advanced informatics: concepts, theory and application (ICAICTA), IEEE, pp 1–6

  32. Song Ruihua, Luo Zhenxiao, Nie Jian-Yun, Yong Yu, Hon Hsiao-Wuen (2009) Identification of ambiguous queries in web search. Inf Process Manag 45(2):216–229

    Article  Google Scholar 

  33. Song R, Zhang M, Sakai T, Kato MP, Liu Y, Sugimoto M, Wang Q, Orii N (2011) Overview of the ntcir-9 intent task. In: Proceedings of NTCIR. Citeseer

  34. Song Wei, Liu Ying, Liu Li-zhen, Wang Han-shi (2018) Semantic composition of distributed representations for query subtopic mining. Front Inf Technol Electron Eng 19(11):1409–1419

    Article  Google Scholar 

  35. Sparck-Jones K, Robertson SE, Sanderson M (2007) Ambiguous requests: implications for retrieval tests, systems and theories. In: ACM SIGIR Forum, vol 41, ACM, pp 8–17

  36. Trotman A, Puurula A, Burgess B (2014) Improvements to bm25 and language models examined. In: Proceedings of the 2014 Australasian document computing symposium, ACM, p 58

  37. Md Zia Ullah and Masaki Aono (2016) A bipartite graph-based ranking approach to query subtopics diversification focused on word embedding features. IEICE Trans Inf Syst 99(12):3090–3100

    Google Scholar 

  38. Ullah MZ, Shajalal M, Aono M (2016) Kdeim at ntcir-12 imine-2 search intent mining task: query understanding through diversified ranking of subtopics. In: The 12th NTCIR conference

  39. Ullah MZ, Shajalal M, Chy AN, Aono M (2016) Query subtopic mining exploiting word embedding for search result diversification. In: Information retrieval technology, Springer, pp 308–314

  40. Wang Chieh-Jen, Lin Yung-Wei, Tsai Ming-Feng, Chen Hsin-Hsi (2013) Mining subtopics from different aspects for diversifying search results. Inf Retr 16(4):452–483

    Article  Google Scholar 

  41. Wang Qinglei, Qian Yanan, Song Ruihua, Dou Zhicheng, Zhang Fan, Sakai Tetsuya, Zheng Qinghua (2013) Mining subtopics from text fragments for a web query. Inf Retr 16(4):484–503

    Article  Google Scholar 

  42. Bei Wu, Wei Bifan, Liu Jun, Guo Zhaotong, Zheng Yuanhao, Chen Yihe (2018) Facet annotation by extending cnn with a matching strategy. Neural Comput 30(6):1647–1672

    Article  Google Scholar 

  43. Xia Y, Zhong X, Tang G, Wang J, Zhou Q, Zheng TF, Hu Q, Na S, Huang Y (2013) Ranking search intents underlying a query. In: International conference on application of natural language to information systems, Springer, p 266–271

  44. Xue Y, Chen F, Damien A, Luo C, Li X, Huo S, Zhang M, Liu Y, Ma S (2013) Thuir at ntcir-10 intent-2 task. In: Proceedings of NTCIR

  45. Yamamoto T, Liu Y, Zhang M, Dou Z, Zhou K, Markov I, Kato MP, Ohshima H, Fujita S (2016) Overview of the ntcir-12 imine-2 task. In: Proceedings of the NTCIR, vol 12

  46. Yu H-T, Ren F (2014) Subtopic mining via modifier graph clustering. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, pp 337–347

  47. Yue M, Dou Z, Hu S, Li J, Wang X-J, Wen J-R (2016) Rucir at ntcir-12 imine-2 task. In: The 12th NTCIR conference

  48. Zhenzhong Zhang, Le Sun, Xianpei Han (2015) Learning to mine query subtopics from query log. ACL 2:341–345

    Google Scholar 

  49. Zheng Wei, Fang Hui, Cheng Hong, Wang Xuanhui (2012) Diversifying search results through pattern-based subtopic modeling. Int J Semant Web Inf Syst 8(4):37–56

    Article  Google Scholar 

  50. Zhong M, Wang Y, Zhu Y (2018) Coverage-oriented diversification of keyword search results on graphs. In: International conference on database systems for advanced applications, Springer, pp 166–183

  51. Zhou Y, Croft WB (2005) Document quality models for web ad hoc retrieval. In: Proceedings of the 14th ACM international conference on information and knowledge management, ACM, pp 331–332

Download references

Acknowledgements

This work has been conducted at Toyohashi University of Technology, Japan, and supported by KAKENHI, Grants-in-Aid for Scientific Research (B), 17H01746.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Shajalal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shajalal, M., Aono, M. Coverage-based query subtopic diversification leveraging semantic relevance. Knowl Inf Syst 62, 2873–2891 (2020). https://doi.org/10.1007/s10115-020-01470-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01470-3

Keywords

  • Subtopic mining
  • Relatedness
  • Sentence embedding
  • Coverage-based diversification