Efficient Text Proximity Search

  • Ralf Schenkel
  • Andreas Broschart
  • Seungwon Hwang
  • Martin Theobald
  • Gerhard Weikum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4726)

Abstract

In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This paper presents an efficient evaluation framework including a proximity scoring function integrated within a top-k query engine for text retrieval. We propose precomputed and materialized index structures that boost performance. The increased retrieval effectiveness and efficiency of our framework are demonstrated through extensive experiments on a very large text benchmark collection. In combination with static index pruning for the proximity lists, our algorithm achieves an improvement of two orders of magnitude compared to a term-based top-k evaluation, with a significantly improved result quality.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: SIGIR, pp. 35–42 (2001)Google Scholar
  2. 2.
    Anh, V.N., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: SIGIR, pp. 372–379 (2006)Google Scholar
  3. 3.
    Bast, H., et al.: Io-top-k: Index-access optimized top-k query processing. In: VLDB, pp. 475–486 (2006)Google Scholar
  4. 4.
    Botev, C., et al.: Expressiveness and performance of full-text search languages. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Boehm, K., Kemper, A., Grust, T., Boehm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 349–367. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Büttcher, S., Clarke, C.L.A.: Efficiency vs. effectiveness in terabyte-scale information retrieval. In: TREC (2005)Google Scholar
  6. 6.
    Büttcher, S., Clarke, C.L.A., Lushman, B.: Term proximity scoring for ad-hoc retrieval on very large text collections. In: SIGIR, pp. 621–622 (2006)Google Scholar
  7. 7.
    Callan, J.P., Croft, W.B., Broglio, J.: Trec and tipster experiments with inquery. Inf. Process. Manage. 31(3), 327–343 (1995)CrossRefGoogle Scholar
  8. 8.
    Chang, M., Poon, C.K.: Efficient phrase querying with common phrase index. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 61–71. Springer, Heidelberg (2006)Google Scholar
  9. 9.
    Clarke, C.L.A., Cormack, G.V., Tudhope, E.A.: Relevance ranking for one to three term queries. Inf. Process. Manage. 36(2), 291–311 (2000)CrossRefGoogle Scholar
  10. 10.
    de Kretser, O., Moffat, A.: Effective document presentation with a locality-based similarity heuristic. In: SIGIR, pp. 113–120 (1999)Google Scholar
  11. 11.
    de Moura, E.S., et al.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18(2), 113–139 (2000)CrossRefGoogle Scholar
  12. 12.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Grossman, D.A., Frieder, O.: Information Retrieval. Springer, Heidelberg (2005)Google Scholar
  14. 14.
    Hawking, D.: Efficiency/effectiveness trade-offs in query processing. SIGIR Forum 32(2), 16–22 (1998)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Metzler, D., et al.: Indri at TREC 2004: Terabyte track. In: TREC (2004)Google Scholar
  16. 16.
    Monz, C.: Minimal span weighting retrieval for question answering. In: IR4QA (2004)Google Scholar
  17. 17.
    Papka, R., Allan, J.: Why bigger windows are better than small ones. Technical report, CIIR (1997)Google Scholar
  18. 18.
    Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 207–218. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Soffer, A., et al.: Static index pruning for information retrieval systems. In: SIGIR, pp. 43–50 (2001)Google Scholar
  20. 20.
    Song, R., et al.: Viewing term proximity from a different perspective. Technical Report MSR-TR-2005-69, Microsoft Research Asia (May 2005)Google Scholar
  21. 21.
    Theobald, M., Schenkel, R., Weikum, G.: An efficient and versatile query engine for TopX search. In: VLDB, pp. 625–636 (2005)Google Scholar
  22. 22.
    Williams, H.E., et al.: What’s next? index structures for efficient phrase querying. In: Australasian Database Conference, pp. 141–152 (1999)Google Scholar
  23. 23.
    Williams, H.E., et al.: Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22(4), 573–594 (2004)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Moffat, A., Bell, T.: Managing Gigabytes. Morgan Kaufman, San Francisco (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Ralf Schenkel
    • 1
  • Andreas Broschart
    • 1
  • Seungwon Hwang
    • 2
  • Martin Theobald
    • 3
  • Gerhard Weikum
    • 1
  1. 1.Max-Planck-Institut für Informatik, SaarbrückenGermany
  2. 2.POSTECHKorea
  3. 3.Stanford University 

Personalised recommendations