Exploiting entity relationship for query expansion in enterprise search

Abstract

Enterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Enterprise data contain both structured and unstructured information. Since these two types of information are complementary and the structured information such as relational databases is designed based on ER (entity-relationship) models, there is a rich body of information about entities in enterprise data. As a result, many information needs of enterprise search center around entities. For example, a user may formulate a query describing a problem that she encounters with an entity, e.g., the web browser, and want to retrieve relevant documents to solve the problem. Intuitively, information related to the entities mentioned in the query, such as related entities and their relations, would be useful to reformulate the query and improve the retrieval performance. However, most existing studies on query expansion are term-centric. In this paper, we propose a novel entity-centric query expansion framework for enterprise search. Specifically, given a query containing entities, we first utilize both unstructured and structured information to find entities that are related to the ones in the query. We then discuss how to adapt existing feedback methods to use the related entities and their relations to improve search quality. Experimental results over two real-world enterprise collections show that the proposed entity-centric query expansion strategies are more effective and robust to improve the search performance than the state-of-the-art pseudo feedback methods for long natural language-like queries with entities. Moreover, results over a TREC ad hoc retrieval collections show that the proposed methods can also work well for short keyword queries in the general search domain.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    http://crf.sf.net

  2. 2.

    SecondString Java package: http://secondstring.sourceforge.net/

  3. 3.

    The notations will be used throughout the remaining of the paper.

  4. 4.

    Implementation provided by Ivory: http://lintool.github.io/Ivory/

  5. 5.

    Actually there are 9 queries qualified for internal relation expansion in ENT1. However, since the query set is too small to construct working set, we do not report the results.

References

  1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). DBpedia: A nucleus for a web of open data. In Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (Eds.), The semantic web, volume 4825 of lecture notes in computer science (pp. 722–735). Berlin: Springer.

  2. Bailey, P., Craswell, N., de Vries, A. P., & Soboroff, I. (2007). Overview of the TREC 2007 enterprise track. In: Proceedings of TREC’07.

  3. Bailey, P., Hawking, D., & Matson, B. (2006). Secure search in enterprise webs: Tradeoffs in efficient implementation for document level security. In CIKM (pp. 493–502).

  4. Balog, K. (2007). People search in the enterprise. In SIGIR (pp. 916–916).

  5. Balog, K., Azzopardi, L., & de Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR (pp. 43–50).

  6. Balog, K., & de Rijke, M. (2008). Non-local evidence for expert finding. In CIKM (pp. 489–498).

  7. Balog, K., de Vries, A. P., Serdyukov, P., Thomas, P., & Westerveld, T. (2010). Overview of the TREC 2009 entity track. In Proceedings of TREC.

  8. Balog, K., Serdyukov, P., & de Vries, A. P. (2011). Overview of the TREC 2010 entity track. In Proceedings of TREC.

  9. Balog, K., Soboroff, I., Thomas, P., Bailey, P., Craswell, N., & de Vries, A. P. (2008) Overview of the TREC 2008 enterprise track. In Proceedings of TREC’08.

  10. Bendersky, M., & Croft, W. B. (2012). Modeling higher-order term dependencies in information retrieval using query hypergraphs. In SIGIR (pp. 941–950).

  11. Bendersky, M., Metzler, D., & Croft, W. B. (2010). Learning concept importance using a weighted dependence model. In Proceedings of the third ACM international conference on web search and data mining, WSDM ’10 (pp. 31–40).

  12. Bendersky, M., Metzler, D., & Croft, W. B. (2011). Parameterized concept weighting in verbose queries. In SIGIR (pp. 605–614).

  13. Brunnert, J., Alonso, O., & Riehle, D. (2007). Enterprise people and skill discovery using tolerant retrieval and visualization. In ECIR (pp. 674–677).

  14. Cao, G., Nie, J.-Y., Gao, J., & Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback. In SIGIR (pp. 243–250).

  15. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., H. E. R. Jr., & Mitchell T. M. (2010). Toward an architecture for never-ending language learning. In AAAI.

  16. Coffman, J., & Weaver, A. (2013). An empirical performance evaluation of relational keyword search techniques. Knowledge and Data Engineering, IEEE Transactions on PP(99), pp. 1–1.

  17. Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In IJCAI (pp. 73–78).

  18. Craswell, N., de Vries, A. P., & Soboroff, I. (2005). Overview of the TREC 2005 enterprise track. In Proceedings of TREC’05.

  19. Şah, M., & Wade, V. (2010). Automatic metadata extraction from multilingual enterprise content. In CIKM, (pp. 1665–1668).

  20. Demartini, G., de Vries, A., Iofciu, T., & Zhu, J. (2009). Overview of the INEX 2008 entity ranking track. In Focused Retrieval and Evaluation (pp. 243–252).

  21. Demartini, G., Iofciu, T., & de Vries, A. (2010). Overview of the INEX 2009 entity ranking track. In Focused Retrieval and Evaluation (pp. 254–264).

  22. Doan, A., Ramakrishnan, L. G. R., & Vaithyanathan, S. (2009). Introduction to the special issue on managing information extraction. SIGMOD Record, 37(4).

  23. Fang, H., & Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In SIGIR (pp. 115–122).

  24. Feldman, S., & Sherman, C. (2003). The high cost of not finding information. In Technical Report No. 29127. IDC.

  25. Freund, L., & Toms, E. G. (2006). Enterprise search behaviour of software engineers. In SIGIR (pp. 645–646).

  26. Garcia-Molina, H., Ullman, J., & Widom, J. (2008). Database systems: the complete book. Upper Saddle River, NJ: Prentice-Hall.

  27. Hawking, D. (2004). Challenges in enterprise search. In Proceedings of ADC’04 (pp. 15–24).

  28. Hearst, M. A. (2011). ’Natural’ search user interfaces. Communications of the ACM 54(11), 60–67.

    Article  Google Scholar 

  29. Kolla, M., & Vechtomova, O. (2007). Retrieval of discussions from enterprise mailing lists. In SIGIR (pp. 881–882).

  30. Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In SIGIR (pp. 111–119).

  31. Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282–289).

  32. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In SIGIR (pp. 120–127).

  33. Lin, T., Pantel, P., Gamon, M., Kannan, A., & Fuxman, A. (2012). Active objects: Actions for entity-centric search. In WWW (pp. 589–598).

  34. Liu, X., Fang, H., Yao, C.-L., & Wang, M. (2011). Finding relevant information of certain types from enterprise data. In CIKM (pp. 47–56).

  35. Lv, Y., & Zhai, C. (2009). A comparative study of methods for estimating query language models with pseudo feedback. In SIGIR (pp. 1895–1898).

  36. Lv, Y., & Zhai, C. (2010). Positional relevance model for pseudo-relevance feedback. In SIGIR (pp. 579–586).

  37. Macdonald, C., & Ounis, I. (2006). Combining fields in known-item email search. In SIGIR (pp. 675–676).

  38. Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In SIGIR (pp. 472–479).

  39. Metzler, D., Croft, & W. B. (2007). Latent concept expansion using Markov random fields. In SIGIR (pp. 311–318).

  40. Mihalcea, R., & Csomai, A. (2007). Wikify! Linking documents to encyclopedic knowledge. In Proceedings of CIKM (pp. 233–242).

  41. Miller, D. R. H., Leek, T., & Schwartz, R. M. (1999). A hidden Markov model information retrieval system. In SIGIR (pp. 214–221).

  42. Ponte, J. M., & Croft, W. B. (1998.) A language modeling approach to information retrieval. In SIGIR (pp. 275–281).

  43. Rizzolo, N., & Roth, D. (2010). Learning based Java for rapid development of NLP systems. In LREC, 5.

  44. Rocchio, J. (1971). Relevance feedback in information retrieval. In: Salton G. (Eds.) The SMART retrieval system: Experiments in automatic document processing, Prentice-Hall Series in Automatic Computation, chapter 14 (pp. 313–323). Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  45. Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases 1(3), 261–377.

    Article  Google Scholar 

  46. Serdyukov, P., Rode, H., & Hiemstra, D. (2008). Modeling multi-step relevance propagation for expert finding. In CIKM (pp. 1133–1142).

  47. Shen, W., Wang, J., Luo, P., Wang, M. (2012). LINDEN: Linking named entities with knowledge base via semantic knowledge. In Proceedings of the 21st international conference on world wide web, WWW ’12 (pp. 449–458).

  48. Soboroff, I., de Vries, A. P., & Craswell, N. (2006). Overview of the TREC 2006 enterprise track. In Proceedings of TREC’06.

  49. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge unifying WordNet and wikipedia. In WWW (pp. 697–706).

  50. Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR (pp. 263–270).

  51. Tao, T., & Zhai, C. (2006). Regularized estimation of mixture models for robust pseudo-relevance feedback. In SIGIR (pp. 162–169).

  52. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge: The MIT Press.

    Google Scholar 

  53. Wang, L., Bennett, P. N., & Collins-Thompson, K. (2012). Robust ranking models via risk-sensitive optimization. In SIGIR (pp. 761–770).

  54. Weerkamp, W., Balog, K., & de Rijke, M. (2012). Exploiting external collections for query expansion. ACM Transactions on the Web, 6(4).

  55. Weerkamp, W., Balog, K., & Meij, E. (2009). A generative language modeling approach for ranking entities. In Focused Retrieval and Evaluation (pp. 292–299).

  56. Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In SIGIR (pp. 4–11).

  57. Zelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. The Journal of Machine Learning Research 3, 1083–1106.

    MATH  MathSciNet  Google Scholar 

  58. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to Ad Hoc information retrieval. In SIGIR (pp. 334–342).

  59. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM.

  60. Zhu, J., Nie, Z., Liu, X., Zhang, B., & Wen, J.-R. (2009). StatSnowball: A statistical approach to extracting entity relationships. In WWW (pp. 101–110).

Download references

Acknowledgments

This material is based upon work supported by the HP Labs Innovation Research Program. We thank reviewers for their useful comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xitong Liu.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Liu, X., Chen, F., Fang, H. et al. Exploiting entity relationship for query expansion in enterprise search. Inf Retrieval 17, 265–294 (2014). https://doi.org/10.1007/s10791-013-9237-0

Download citation

Keywords

  • Entity centric
  • Enterprise search
  • Retrieval
  • Query expansion
  • Combining structured and unstructured data