Latent entity space: a novel retrieval approach for entity-bearing queries

Abstract

Analysis on Web search query logs has revealed that there is a large portion of entity-bearing queries, reflecting the increasing demand of users on retrieving relevant information about entities such as persons, organizations, products, etc. In the meantime, significant progress has been made in Web-scale information extraction, which enables efficient entity extraction from free text. Since an entity is expected to capture the semantic content of documents and queries more accurately than a term, it would be interesting to study whether leveraging the information about entities can improve the retrieval accuracy for entity-bearing queries. In this paper, we propose a novel retrieval approach, i.e., latent entity space (LES), which models the relevance by leveraging entity profiles to represent semantic content of documents and queries. In the LES, each entity corresponds to one dimension, representing one semantic relevance aspect. We propose a formal probabilistic framework to model the relevance in the high-dimensional entity space. Experimental results over TREC collections show that the proposed LES approach is effective in capturing latent semantic content and can significantly improve the search accuracy of several state-of-the-art retrieval models for entity-bearing queries.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    A media group which owned Rocky Mountain News.

  2. 2.

    A daily newspaper which is the rival of Rocky Mountain News in Denver, Colorado.

  3. 3.

    We also tried other linear regression methods, and they could not deliver better performance.

  4. 4.

    http://lemurproject.org/clueweb09/related-data.php.

  5. 5.

    We use the Freebase entity annotations directly as weighted key concepts in the query.

  6. 6.

    http://lintool.github.io/Ivory/.

  7. 7.

    Actually they do not contain results for query #95 and #100, as there are no official judgments from TREC. Therefore, it is essentially the same as 200 queries in our case.

  8. 8.

    http://ciir.cs.umass.edu/downloads/eqfe/.

  9. 9.

    http://lemurproject.org/clueweb12/FACC1/.

References

  1. Balog, K., Azzopardi, L., & De Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR (pp. 43–50).

  2. Balog, K., de Vries, A. P., Serdyukov, P., Thomas, P., & Westerveld, T. (2010). Overview of the TREC 2009 entity track. In Proceedings of TREC.

  3. Balog, K., Serdyukov, P., & de Vries, A. P. (2011). Overview of the TREC 2010 entity track. In Proceedings of TREC.

  4. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the Web. IJCAI, 7, 2670–2676.

    Google Scholar 

  5. Bendersky, M., & Croft, W. B. (2008). Discovering key concepts in verbose queries. In SIGIR (pp. 491–498).

  6. Billerbeck, B., & Zobel, J. (2004). Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian database conference-Volume 27 (pp. 69–76). Australian Computer Society Inc.

  7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of machine Learning research, 3, 993–1022.

    MATH  Google Scholar 

  8. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD (pp. 1247–1250).

  9. Cafarella, M. J., Madhavan, J., & Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record, 37(4), 55–61.

    Article  Google Scholar 

  10. Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 Web track. In TREC.

  11. Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. (2010). Overview of the TREC 2010 Web track. In TREC.

  12. Clarke, C. L. A., Craswell, N., Soboroff, I., & Voorhees, E. (2011). Overview of the TREC 2011 Web track. In TREC.

  13. Clarke, C. L. A., Craswell, N., & Voorhees, E. (2012). Overview of the TREC 2012 Web track. In TREC.

  14. Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C. L. A., & Voorhees, E. M. (2013). TREC 2013 Web track overview. In TREC.

  15. Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., & Voorhees, E. M. (2014). TREC 2014 Web track overview. In TREC.

  16. Cormack, G., Smucker, M., & Clarke, C. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465.

    Article  Google Scholar 

  17. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.

    MATH  Google Scholar 

  18. Craswell, N., de Vries, A. P., & Soboroff, I. (2005). Overview of the TREC 2005 enterprise track. In Proceedings of TREC.

  19. Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In EMNLP-CoNLL, 7, 708–716.

  20. Dalton, J., Dietz, L., & Allan, J. (2014). Entity query feature expansion using knowledge base links. In SIGIR (pp. 365–374).

  21. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.

    Article  Google Scholar 

  22. Demartini, G. (2011). From people to entities: Typed search in the enterprise and the web. PhD thesis, Leibniz University of Hannover, Germany.

  23. Demartini, G., de Vries, A., Iofciu, T., & Zhu, J. (2009). Overview of the INEX 2008 entity ranking track. In Focused retrieval and evaluation (pp. 243–252).

  24. Demartini, G., Gaugaz, J., & Nejdl, W. (2009) A vector space model for ranking entities and its application to expert search. In ECIR (pp. 189–201).

  25. Egozi, O., Markovitch, S., & Gabrilovich, E. (2011). Concept-based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems (TOIS), 29(2), 8.

    Article  Google Scholar 

  26. Elsas, J. L., Arguello, J., Callan, J., & Carbonell, J. G. (2008). Retrieval and feedback models for blog feed search. In SIGIR (pp. 347–354).

  27. Fang, H., Zhai, C. (2007). Probabilistic models for expert finding. In ECIR (pp. 418–430).

  28. Frank, J. R., Kleiman-Weiner, M., Roberts, D. A., Niu, F., Zhang, C., Ré, C., & Soboroff, I. (2012). Building an entity-centric stream filtering test collection for TREC 2012. In Proceedings of TREC.

  29. Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34(2), 443.

    MATH  Google Scholar 

  30. Gabrilovich, E., Ringgaard, M., & Subramanya, A. (2013). FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0). http://lemurproject.org/clueweb09/FACC1/, June 2013.

  31. Grootjen, F. A., & Van Der Weide, T. P. (2006). Conceptual query expansion. Data & Knowledge Engineering, 56(2), 174–193.

    Article  Google Scholar 

  32. He, B., & Ounis, I. (2006). Query performance prediction. Information Systems, 31(7), 585–594.

    Article  Google Scholar 

  33. Lafferty, J., & Zhai, C. (2003). Probabilistic relevance models based on document and query generation. Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval.

  34. Lavrenko, V., & Croft, W.B. (2001). Relevance-based language models. In SIGIR (pp. 120–127).

  35. Lin, T., Pantel, P., Gamon, M., Kannan, A., & Fuxman, A. (2012). Active objects: Actions for entity-centric search. In WWW (pp. 589–598).

  36. Liu, X., Chen, F., Fang, H., & Wang, M. (2014a). Exploiting entity relationship for query expansion in enterprise search. Information Retrieval, 17(3), 265–294.

    Article  Google Scholar 

  37. Liu, X., Yang, P., & Fang, H. (2014b). Entity came to rescue - leveraging entities to minimize risks in web search. In TREC.

  38. Macdonald, C., & Ounis, I. (2006). Voting for candidates: Adapting data fusion techniques for an expert search task. In CIKM (pp. 387–396).

  39. Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In SIGIR (pp. 472–479).

  40. Metzler, D., & Croft, W. B. (2007). Latent concept expansion using Markov random fields. In SIGIR (pp. 311–318).

  41. Milne, D. N., Witten, I. H., & Nichols, D. M. (2007). A knowledge-based search engine powered by Wikipedia. In CIKM (pp. 445–454).

  42. Petkova, D., & Croft, W. B. (2007). Proximity-based document representation for named entity retrieval. In CIKM (pp. 731–740).

  43. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR (pp. 275–281).

  44. Pound, J., Mika, P., & Zaragoza, H. (2010). Ad-hoc object retrieval in the web of data. In WWW (pp. 771–780).

  45. Robertson, S. E., & Walker, S. (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR (pp. 232–241).

  46. Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    MATH  Article  Google Scholar 

  47. Soboroff, I., de Vries, A. P., Craswell, N. (2006). Overview of the TREC 2006 enterprise track. In Proceedings of TREC.

  48. Styltsvig, H. B. (2006). Ontology-based information retrieval. PhD thesis, Roskilde University, Denmark.

  49. Vallet, D., Fernández, M., & Castells, P. (2005) An ontology-based information retrieval model. In The Semantic Web: Research and Applications (pp. 455–470). Springer: Berlin.

  50. Wang, L., Bennett, P. N., & Collins-Thompson, K. (2012). Robust ranking models via risk-sensitive optimization. In SIGIR (pp. 761–770).

  51. Wei, X., & Croft, W. B. (2006). LDA-Based document models for Ad-hoc retrieval. In SIGIR (pp. 178–185).

  52. Xu, Y., Jones, G. J., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In SIGIR (pp. 59–66).

  53. Yang, P., & Fang, H. (2013). Evaluating the effectiveness of axiomatic approaches in web track. In TREC.

  54. Zhai, C., & Lafferty, J. (2001a). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR (pp. 334–342).

  55. Zhai, C., & Lafferty, J. (2001b). Model-based feedback in the language modeling approach to information retrieval. In CIKM (pp. 403–410).

  56. Zhou, Y., & Croft, W. B. (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 543–550). ACM.

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number IIS-1423002. We thank the anonymous reviewers for their useful comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xitong Liu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Fang, H. Latent entity space: a novel retrieval approach for entity-bearing queries. Inf Retrieval J 18, 473–503 (2015). https://doi.org/10.1007/s10791-015-9267-x

Download citation

Keywords

  • Latent entity space
  • Entity profile
  • Document retrieval