Abstract
Inferring query intent is significant in information retrieval tasks. Query subtopic mining aims to find possible subtopics for a given query to represent potential intents. Subtopic mining is challenging due to the nature of short queries. Learning distributed representations or sequences of words has been developed recently and quickly, making great impacts on many fields. It is still not clear whether distributed representations are effective in alleviating the challenges of query subtopic mining. In this paper, we exploit and compare the main semantic composition of distributed representations for query subtopic mining. Specifically, we focus on two types of distributed representations: paragraph vector which represents word sequences with an arbitrary length directly, and word vector composition. We thoroughly investigate the impacts of semantic composition strategies and the types of data for learning distributed representations. Experiments were conducted on a public dataset offered by the National Institute of Informatics Testbeds and Community for Information Access Research. The empirical results show that distributed semantic representations can achieve outstanding performance for query subtopic mining, compared with traditional semantic representations. More insights are reported as well.
Similar content being viewed by others
References
Anagnostopoulos I, Razis G, Mylonas P, et al., 2015. Semantic query suggestion using Twitter entities. Neurocomputing, 163:137–150. https://doi.org/10.1016/j.neucom.2014.12.090
Baeza-Yates R, Hurtado C, Mendoza M, 2005. Query recommendation using query logs in search engines. LNCS, 3268:588–596. https://doi.org/10.1007/978-3-540-30192-9_58
Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.238–247. https://doi.org/10.3115/v1/P14-1023
Beeferman D, Berger A, 2000. Agglomerative clustering of a search engine query log. Proc 6th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.407–416. https://doi.org/10.1145/347090.347176
Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. J Mach Learn Res, 3: 1137–1155.
Clarke CLA, Craswell N, Soboroff I, 2009. Overview of the TREC 2009 web track. 18th Text Retrieval Conf, p.1–9.
Collobert R, Weston J, 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. Proc 25th Int Conf on Machine Learning, p.160–167. https://doi.org/10.1145/1390156.1390177
Damien A, Zhang M, Liu Y, et al., 2013. Improve web search diversification with intent subtopic mining. CCIS, 400: 322–333. https://doi.org/10.1007/978-3-642-41644-6_30
Dang V, Xue X, Croft WB, 2011. Inferring query aspects from reformulations using clustering. Proc 20th ACM Int Conf on Information and Knowledge Management, p.2117–2120. https://doi.org/10.1145/2063576.2063904
Grefenstette E, Dinu G, Zhang YZ, et al., 2013. Multi-step regression learning for compositional distributional semantics. https://doi.org/arxiv.org/abs/1301.6939
Hu J, Wang G, Lochovsky F, et al., 2009. Understanding user’s query intent with Wikipedia. Proc 18th Int Conf on World Wide Web, p.471–480. https://doi.org/10.1145/1526709.1526773
Hu Y, Qian Y, Li H, et al., 2012. Mining query subtopics from search log data. Proc 35th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.305–314. https://doi.org/10.1145/2348283.2348327
Jiang X, Han X, Sun L, 2011. ISCAS at subtopic mining task in NTCIR9. Proc NTCIR-9 Workshop Meeting, p.168–171.
Joho H, Kishida K, 2014. Overview of NTCIR-11. Proc 11th NTCIR Conf, p.1–7.
Jones R, Klinkner KL, 2008. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. Proc 17th ACM Conf on Information and Knowledge Management, p.699–708. https://doi.org/10.1145/1458082.1458176
Karunasekera S, Harwood A, Samarawickrama S, et al., 2014. Topic-specific post identification in microblog streams. IEEE Int Conf on Big Data, p.7–13. https://doi.org/10.1109/BigData.2014.7004416
Kim SJ, Lee JH, 2013. Subtopic mining based on headmodifier relation and co-occurrence of intents using web documents. LNCS, 8138:179–191. https://doi.org/10.1007/978-3-642-40802-1_22
Le Q, Mikolov T, 2014. Distributed representations of sentences and documents. Proc 31st Int Conf on Machine Learning, p.1188–1196.
Li X, Wang YY, Acero A, 2008. Learning query intent from regularized click graphs. Proc 31st Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.339–346. https://doi.org/10.1145/1390334.1390393
Liu Y, Song R, Zhang M, et al., 2014. Overview of the NTCIR-11 IMine task. Proc 11th NTCIR Conf, p.8–23.
Luo C, Liu Y, Zhang M, et al., 2014. Query recommendation based on user intent recognition. J Chin Inform Process, 28(1):64–72 (in Chinese). https://doi.org/10.3969/j.issn.1003-0077.2014.01.009
Mikolov T, Chen K, Corrado G, et al., 2013a. Efficient estimation of word representations in vector space. https://doi.org/arxiv.org/abs/1301.3781
Mikolov T, Yih WT, Zweig G, 2013b. Linguistic regularities in continuous space word representations. Proc NAACLHLT, p.746–751.
Mitchell J, Lapata M, 2010. Composition in distributional models of semantics. Cogn Sci, 34(8):1388–1429. https://doi.org/10.1111/j.1551-6709.2010.01106.x
Mnih A, Hinton G, 2007. Three new graphical models for statistical language modelling. Proc 24th Int Conf on Machine Learning, p.641–648. https://doi.org/10.1145/1273496.1273577
Radlinski F, Szummer M, Craswell N, 2010. Inferring query intent from reformulations and clicks. Proc 19th Int Conf on World Wide Web, p.1171–1172. https://doi.org/10.1145/1772690.1772859
Rafiei D, Bharat K, Shukla A, 2010. Diversifying web search results. Proc 19th Int Conf on World Wide Web, p.781–790. https://doi.org/10.1145/1772690.1772770
Sakai T, Song R, 2011. Evaluating diversified search results using per-intent graded relevance. Proc 34th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.1043–1052. https://doi.org/10.1145/2009916.2010055
Sakai T, Dou Z, Yamamoto T, et al., 2013. Overview of the NTCIR-10 INTENT-2 task. Proc 10th NTCIR Conf, p.94–123.
Santos RLT, Macdonald C, Ounis I, 2010. Exploiting query reformulations for web search result diversification. Proc 19th Int Conf on World Wide Web, p.881–890. https://doi.org/10.1145/1772690.1772780
Socher R, Lin CC, Ng AY, et al., 2011a. Parsing natural scenes and natural language with recursive neural networks. Proc 28th Int Conf on Machine Learning, p.129–136.
Socher R, Pennington J, Huang EH, et al., 2011b. Semisupervised recursive autoencoders for predicting sentiment distributions. Proc Conf on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, p.151–161.
Song R, Luo Z, Nie JY, et al., 2009. Identification of ambiguous queries in web search. Inform Process Manag, 45(2): 216–229. https://doi.org/10.1016/j.ipm.2008.09.005
Song R, Zhang M, Sakai T, et al., 2011. Overview of the NTCIR-9 INTENT task. Proc NTCIR-9 Workshop Meeting, p.82–105.
Song W, Yu Q, Xu ZH, et al., 2012. Multi-aspect query summarization by composite query. Proc 35th Int ACM SIGIR Conf on Research and development in Information Retrieval, p.325–334. https://doi.org/10.1145/2348283.2348329
Song W, Liu Y, Liu L, et al., 2016. Examining personalization heuristics by topical analysis of query log. Int J Innov Comput Inform Contr, 12(5):1745–1760.
Strohmaier M, Kröll M, Körner C, 2009. Intentional query suggestion: making user goals more explicit during search. Proc Workshop on Web Search Click Data, p.68–74. https://doi.org/10.1145/1507509.1507520
Wang CJ, Lin YW, Tsai MF, et al., 2013. Mining subtopics from different aspects for diversifying search results. Inform Retriev, 16(4):452–483. https://doi.org/10.1007/s10791-012-9215-y
Xu J, Croft WB, 1996. Query expansion using local and global document analysis. Proc 19th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.4–11. https://doi.org/10.1145/243199.243202
Yu M, Dredze M, 2015. Learning composition models for phrase embeddings. Trans Assoc Comput Ling, 3:227–242.
Zanzotto FM, Korkontzelos I, Fallucchi F, et al., 2010. Estimating linear models for compositional distributional semantics. Proc 23rd Int Conf on Computational Linguistics, p.1263–1271.
Zeng HJ, He QC, Chen Z, et al., 2004. Learning to cluster web search results. Proc 27th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.210–217. https://doi.org/10.1145/1008992.1009030
Zhao Y, Liu Z, Sun M, 2015. Phrase type sensitive tensor indexing model for semantic composition. Proc 29th AAAI Conf on Artificial Intelligence, p.2195–2201.
Zheng W, Fang H, 2011. A comparative study of search result diversification methods. 1st Int Workshop on Diversity in Document Retrieval, p.55–62.
Zheng W, Fang H, Yao C, et al., 2014. Leveraging integrated information to extract query subtopics for search result diversification. Inform Retriev, 17(1):52–73. https://doi.org/10.1007/s10791-013-9228-1
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (Nos. 61876113 and 61402304), the Beijing Educational Committee Science and Technology Development Plan of China (No. KM201610028015), and the Beijing Advanced Innovation Center for Imaging Technology of China
Rights and permissions
About this article
Cite this article
Song, W., Liu, Y., Liu, Lz. et al. Semantic composition of distributed representations for query subtopic mining. Frontiers Inf Technol Electronic Eng 19, 1409–1419 (2018). https://doi.org/10.1631/FITEE.1601476
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601476