Abstract
This chapter focuses on the classic problem of ad hoc document retrieval and discusses how entities may be leveraged to improve retrieval performance. Entities facilitate a semantic understanding of both the user’s information need, as expressed by the keyword query, and of the document’s content. We present three different families of approaches: (1) expansion-based methods, which utilize entities as a source of expansion terms to enrich the representation of the query; (2) projection-based methods, which treat entities as a latent layer, while leaving the original document/query representations intact; and (3) entity-based methods, which consider explicitly the entities that are recognized in documents, and embrace entity-based representations in “duet” with traditional term-based representations.
Download chapter PDF
References
Balog, K., Weerkamp, W., de Rijke, M.: A few examples go a long way: Constructing query models from elaborate query formulations. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’08, pp. 371–378. ACM (2008). doi: 10.1145/1390334.1390399
Bendersky, M., Metzler, D., Croft, W.B.: Effective query formulation with multiple information sources. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pp. 443–452 (2012). doi: 10.1145/2124295.2124349
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 2787–2795. Curran Associates Inc. (2013)
Brandão, W.C., Santos, R.L.T., Ziviani, N., de Moura, E.S., da Silva, A.S.: Learning to expand queries using entities. J. Am. Soc. Inf. Sci. Technol. pp. 1870–1883 (2014)
Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pp. 1321–1330. ACM (2011). doi: 10.1145/2063576.2063768
Callan, J.P.: Passage-level evidence in document retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pp. 302–310. Springer (1994)
Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting good expansion terms for pseudo-relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 243–250. ACM (2008). doi: 10.1145/1390334.1390377
Castells, P., Fernandez, M., Vallet, D.: An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. on Knowl. and Data Eng. 19(2), 261–272 (2007). doi: https://doi.org/10.1109/TKDE.2007.22
Chang, M.W., Ratinov, L., Roth, D., Srikumar, V.: Importance of semantic representation: Dataless classification. In: Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pp. 830–835. AAAI Press (2008)
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 Web track. In: The Eighteenth Text REtrieval Conference Proceedings, TREC ’09. NIST Special Publication 500-278 (2010)
Clarke, C.L.A., Craswell, N., Soboroff, I., V. Cormack, G.: Overview of the TREC 2010 Web track. In: The Nineteenth Text REtrieval Conference Proceedings, TREC ’10. NIST Special Publication 500-294 (2011)
Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 Web track. In: The Twentieth Text REtrieval Conference Proceedings, TREC ’11. NIST Special Publication 500-296 (2012)
Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 Web track. In: The Twenty-First Text REtrieval Conference Proceedings, TREC ’12. NIST Special Publication 500-298 (2013)
Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C.L.A., Voorhees, E.M.: TREC 2013 Web track overview. In: The Twenty-Second Text REtrieval Conference Proceedings, TREC ’13. NIST Special Publication 500-302 (2014)
Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., Voorhees, E.M.: TREC 2014 Web track overview. In: The Twenty-Third Text REtrieval Conference Proceedings, TREC ’14. NIST Special Publication 500-308 (2015)
Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice. 1st edn. Addison-Wesley Publishing Co. (2009)
Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base links. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, pp. 365–374. ACM (2014). doi: 10.1145/2600428.2609628
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. Technol. 41(6), 391–407 (1990)
Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, pp. 154–161. ACM (2006). doi: 10.1145/1148170.1148200
Dumais, S.T.: Latent semantic analysis. Ann. Rev. Info. Sci. Tech. 38(1), 188–230 (2004). doi: https://doi.org/10.1002/aris.1440380105
Egozi, O., Gabrilovich, E., Markovitch, S.: Concept-based feature generation and selection for information retrieval. In: Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pp. 1132–1137. AAAI Press (2008)
Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2), 8:1–8:34 (2011)
Ferragina, P., Scaiella, U.: TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pp. 1625–1628. ACM (2010). doi: 10.1145/1871437.1871689
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, pp. 1301–1306. AAAI Press (2006)
Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 34(1), 443–498 (2009)
Gabrilovich, E., Ringgaard, M., Subramanya, A.: FACC1: Freebase annotation of Clueweb corpora, version 1. Tech. rep., Google, Inc. (2013)
Gonzalo, J., Verdejo, F., Chugur, I., Cigarrin, J.: Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL’98 Workshop on Usage of WordNet for NLP, pp. 38–44 (1998)
Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pp. 55–64. ACM (2016). doi: 10.1145/2983323.2983769
Hasibi, F., Balog, K., Bratsberg, S.E.: Exploiting entity linking in queries for entity retrieval. In: Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, ICTIR ’16, pp. 209–218. ACM (2016). doi: 10.1145/2970398.2970406
Hersh, W., Voorhees, E.: TREC genomics special issue overview. Inf. Retr. 12(1), 1–15 (2009). doi: 10.1007/s10791-008-9076-6
Jagerman, R., Eickhoff, C., de Rijke, M.: Computing web-scale topic models using an asynchronous parameter server. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pp. 1337–1340. ACM (2017). doi: 10.1145/3077136.3084135
Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pp. 120–127. ACM (2001). doi: 10.1145/383952.383972
Liu, X., Chen, F., Fang, H., Wang, M.: Exploiting entity relationship for query expansion in enterprise search. Inf. Retr. 17(3), 265–294 (2014). doi: 10.1007/s10791-013-9237-0
Liu, X., Fang, H.: Latent entity space: A novel retrieval approach for entity-bearing queries. Inf. Retr. 18(6), 473–503 (2015). doi: 10.1007/s10791-015-9267-x
Lu, Z., Kim, W., Wilbur, W.J.: Evaluation of query expansion using mesh in pubmed. Inf. Retr. 12(1), 69–80 (2009). doi: 10.1007/s10791-008-9074-8
Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 1895–1898. ACM (2009). doi: 10.1145/1645953.1646259
Macdonald, C., Santos, R.L., Ounis, I.: The whens and hows of learning to rank for web search. Inf. Retr. 16(5), 584–628 (2013). doi: 10.1007/s10791-012-9209-9
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Meij, E., Trieschnigg, D., de Rijke, M., Kraaij, W.: Conceptual language models for domain-specific retrieval. Inf. Process. Manage. 46(4), 448–469 (2010). doi: http://dx.doi.org/10.1016/j.ipm.2009.09.005
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 3111–3119. Curran Associates Inc. (2013)
Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pp. 65–74. ACM (2016). doi: 10.1145/2911451.2911508
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc. (1971)
Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, pp. 543–552. ACM (2014). doi: 10.1145/2556195.2556250
Srba, I., Bielikova, M.: A comprehensive survey and classification of approaches for community question answering. ACM Trans. Web 10(3), 18:1–18:63 (2016). doi: 10.1145/2934687
Stokes, N., Li, Y., Cavedon, L., Zobel, J.: Exploring criteria for successful query expansion in the genomic domain. Inf. Retr. 12(1), 17–50 (2009). doi: 10.1007/s10791-008-9073-9
Voorhees, E.M.: Using wordnet to disambiguate word senses for text retrieval. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93, pp. 171–180. ACM (1993). doi: 10.1145/160688.160715
Voorhees, E.M.: The TREC Robust retrieval track. SIGIR Forum 39(1), 11–20 (2005). doi: 10.1145/1067268.1067272
Weerkamp, W., Balog, K., de Rijke, M.: Exploiting external collections for query expansion. ACM Trans. Web 6(4), 18:1–18:29 (2012). doi: 10.1145/2382616.2382621
Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: Theory and algorithm. In: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1192–1199. ACM (2008). doi: 10.1145/1390156.1390306
Xiong, C., Callan, J.: Esdrank: Connecting query and documents through external semi-structured data. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pp. 951–960. ACM (2015a). doi: 10.1145/2806416.2806456
Xiong, C., Callan, J.: Query expansion with freebase. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval, ICTIR ’15, pp. 111–120. ACM (2015b). doi: 10.1145/2808194.2809446
Xiong, C., Callan, J., Liu, T.Y.: Bag-of-entities representation for ranking. In: Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, ICTIR ’16, pp. 181–184. ACM (2016). doi: 10.1145/2970398.2970423
Xiong, C., Callan, J., Liu, T.Y.: Word-entity duet representations for document ranking. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pp. 763–772. ACM (2017a). doi: 10.1145/3077136.3080768
Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pp. 1271–1279. International World Wide Web Conferences Steering Committee (2017b). doi: 10.1145/3038912.3052558
Xu, Y., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on Wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, pp. 59–66 (2009). doi: 10.1145/1571941.1571954
Yi, X., Allan, J.: A comparative study of utilizing topic models for information retrieval. In: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR ’09, pp. 29–41. Springer-Verlag (2009). doi: 10.1007/978-3-642-00958-7_6
Zhai, C.: Statistical language models for information retrieval A critical review. Found. Trends Inf. Retr. 2(3), 137–213 (2008)
Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of the 10th international conference on Information and knowledge management, CIKM ’01, pp. 403–410. ACM (2001). doi: 10.1145/502585.502654
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2018 The Editor(s) (if applicable) and the Author(s)
About this chapter
Cite this chapter
Balog, K. (2018). Leveraging Entities in Document Retrieval. In: Entity-Oriented Search. The Information Retrieval Series, vol 39. Springer, Cham. https://doi.org/10.1007/978-3-319-93935-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-93935-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93933-9
Online ISBN: 978-3-319-93935-3
eBook Packages: Computer ScienceComputer Science (R0)