Semantic composition of distributed representations for query subtopic mining

  • Wei Song
  • Ying Liu
  • Li-zhen LiuEmail author
  • Han-shi Wang


Inferring query intent is significant in information retrieval tasks. Query subtopic mining aims to find possible subtopics for a given query to represent potential intents. Subtopic mining is challenging due to the nature of short queries. Learning distributed representations or sequences of words has been developed recently and quickly, making great impacts on many fields. It is still not clear whether distributed representations are effective in alleviating the challenges of query subtopic mining. In this paper, we exploit and compare the main semantic composition of distributed representations for query subtopic mining. Specifically, we focus on two types of distributed representations: paragraph vector which represents word sequences with an arbitrary length directly, and word vector composition. We thoroughly investigate the impacts of semantic composition strategies and the types of data for learning distributed representations. Experiments were conducted on a public dataset offered by the National Institute of Informatics Testbeds and Community for Information Access Research. The empirical results show that distributed semantic representations can achieve outstanding performance for query subtopic mining, compared with traditional semantic representations. More insights are reported as well.

Key words

Subtopic mining Query intent Distributed representation Semantic composition 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Anagnostopoulos I, Razis G, Mylonas P, et al., 2015. Semantic query suggestion using Twitter entities. Neurocomputing, 163:137–150. CrossRefGoogle Scholar
  2. Baeza-Yates R, Hurtado C, Mendoza M, 2005. Query recommendation using query logs in search engines. LNCS, 3268:588–596. Google Scholar
  3. Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.238–247. Google Scholar
  4. Beeferman D, Berger A, 2000. Agglomerative clustering of a search engine query log. Proc 6th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.407–416. Google Scholar
  5. Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. J Mach Learn Res, 3: 1137–1155.zbMATHGoogle Scholar
  6. Clarke CLA, Craswell N, Soboroff I, 2009. Overview of the TREC 2009 web track. 18th Text Retrieval Conf, p.1–9.Google Scholar
  7. Collobert R, Weston J, 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. Proc 25th Int Conf on Machine Learning, p.160–167. Google Scholar
  8. Damien A, Zhang M, Liu Y, et al., 2013. Improve web search diversification with intent subtopic mining. CCIS, 400: 322–333. Google Scholar
  9. Dang V, Xue X, Croft WB, 2011. Inferring query aspects from reformulations using clustering. Proc 20th ACM Int Conf on Information and Knowledge Management, p.2117–2120. Google Scholar
  10. Grefenstette E, Dinu G, Zhang YZ, et al., 2013. Multi-step regression learning for compositional distributional semantics. Google Scholar
  11. Hu J, Wang G, Lochovsky F, et al., 2009. Understanding user’s query intent with Wikipedia. Proc 18th Int Conf on World Wide Web, p.471–480. Google Scholar
  12. Hu Y, Qian Y, Li H, et al., 2012. Mining query subtopics from search log data. Proc 35th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.305–314. Google Scholar
  13. Jiang X, Han X, Sun L, 2011. ISCAS at subtopic mining task in NTCIR9. Proc NTCIR-9 Workshop Meeting, p.168–171.Google Scholar
  14. Joho H, Kishida K, 2014. Overview of NTCIR-11. Proc 11th NTCIR Conf, p.1–7.Google Scholar
  15. Jones R, Klinkner KL, 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. Proc 17th ACM Conf on Information and Knowledge Management, p.699–708. Google Scholar
  16. Karunasekera S, Harwood A, Samarawickrama S, et al., 2014. Topic-specific post identification in microblog streams. IEEE Int Conf on Big Data, p.7–13. Google Scholar
  17. Kim SJ, Lee JH, 2013. Subtopic mining based on headmodifier relation and co-occurrence of intents using web documents. LNCS, 8138:179–191. Google Scholar
  18. Le Q, Mikolov T, 2014. Distributed representations of sentences and documents. Proc 31st Int Conf on Machine Learning, p.1188–1196.Google Scholar
  19. Li X, Wang YY, Acero A, 2008. Learning query intent from regularized click graphs. Proc 31st Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.339–346. Google Scholar
  20. Liu Y, Song R, Zhang M, et al., 2014. Overview of the NTCIR-11 IMine task. Proc 11th NTCIR Conf, p.8–23.Google Scholar
  21. Luo C, Liu Y, Zhang M, et al., 2014. Query recommendation based on user intent recognition. J Chin Inform Process, 28(1):64–72 (in Chinese). Google Scholar
  22. Mikolov T, Chen K, Corrado G, et al., 2013a. Efficient estimation of word representations in vector space. Google Scholar
  23. Mikolov T, Yih WT, Zweig G, 2013b. Linguistic regularities in continuous space word representations. Proc NAACLHLT, p.746–751.Google Scholar
  24. Mitchell J, Lapata M, 2010. Composition in distributional models of semantics. Cogn Sci, 34(8):1388–1429. CrossRefGoogle Scholar
  25. Mnih A, Hinton G, 2007. Three new graphical models for statistical language modelling. Proc 24th Int Conf on Machine Learning, p.641–648. Google Scholar
  26. Radlinski F, Szummer M, Craswell N, 2010. Inferring query intent from reformulations and clicks. Proc 19th Int Conf on World Wide Web, p.1171–1172. Google Scholar
  27. Rafiei D, Bharat K, Shukla A, 2010. Diversifying web search results. Proc 19th Int Conf on World Wide Web, p.781–790. Google Scholar
  28. Sakai T, Song R, 2011. Evaluating diversified search results using per-intent graded relevance. Proc 34th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.1043–1052. Google Scholar
  29. Sakai T, Dou Z, Yamamoto T, et al., 2013. Overview of the NTCIR-10 INTENT-2 task. Proc 10th NTCIR Conf, p.94–123.Google Scholar
  30. Santos RLT, Macdonald C, Ounis I, 2010. Exploiting query reformulations for web search result diversification. Proc 19th Int Conf on World Wide Web, p.881–890. Google Scholar
  31. Socher R, Lin CC, Ng AY, et al., 2011a. Parsing natural scenes and natural language with recursive neural networks. Proc 28th Int Conf on Machine Learning, p.129–136.Google Scholar
  32. Socher R, Pennington J, Huang EH, et al., 2011b. Semisupervised recursive autoencoders for predicting sentiment distributions. Proc Conf on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, p.151–161.Google Scholar
  33. Song R, Luo Z, Nie JY, et al., 2009. Identification of ambiguous queries in web search. Inform Process Manag, 45(2): 216–229. CrossRefGoogle Scholar
  34. Song R, Zhang M, Sakai T, et al., 2011. Overview of the NTCIR-9 INTENT task. Proc NTCIR-9 Workshop Meeting, p.82–105.Google Scholar
  35. Song W, Yu Q, Xu ZH, et al., 2012. Multi-aspect query summarization by composite query. Proc 35th Int ACM SIGIR Conf on Research and development in Information Retrieval, p.325–334. Google Scholar
  36. Song W, Liu Y, Liu L, et al., 2016. Examining personalization heuristics by topical analysis of query log. Int J Innov Comput Inform Contr, 12(5):1745–1760.Google Scholar
  37. Strohmaier M, Kröll M, Körner C, 2009. Intentional query suggestion: making user goals more explicit during search. Proc Workshop on Web Search Click Data, p.68–74. Google Scholar
  38. Wang CJ, Lin YW, Tsai MF, et al., 2013. Mining subtopics from different aspects for diversifying search results. Inform Retriev, 16(4):452–483. CrossRefGoogle Scholar
  39. Xu J, Croft WB, 1996. Query expansion using local and global document analysis. Proc 19th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.4–11. Google Scholar
  40. Yu M, Dredze M, 2015. Learning composition models for phrase embeddings. Trans Assoc Comput Ling, 3:227–242.Google Scholar
  41. Zanzotto FM, Korkontzelos I, Fallucchi F, et al., 2010. Estimating linear models for compositional distributional semantics. Proc 23rd Int Conf on Computational Linguistics, p.1263–1271.Google Scholar
  42. Zeng HJ, He QC, Chen Z, et al., 2004. Learning to cluster web search results. Proc 27th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.210–217. Google Scholar
  43. Zhao Y, Liu Z, Sun M, 2015. Phrase type sensitive tensor indexing model for semantic composition. Proc 29th AAAI Conf on Artificial Intelligence, p.2195–2201.Google Scholar
  44. Zheng W, Fang H, 2011. A comparative study of search result diversification methods. 1st Int Workshop on Diversity in Document Retrieval, p.55–62.Google Scholar
  45. Zheng W, Fang H, Yao C, et al., 2014. Leveraging integrated information to extract query subtopics for search result diversification. Inform Retriev, 17(1):52–73. CrossRefGoogle Scholar

Copyright information

© Editorial Office of Journal of Zhejiang University Science and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Information and Engineering CollegeCapital Normal UniversityBeijingChina

Personalised recommendations