On Identifying Phrases Using Collection Statistics

  • Simon Gog
  • Alistair Moffat
  • Matthias Petri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9022)


The use of phrases as part of similarity computations can enhance search effectiveness. But the gain comes at a cost, either in terms of index size, if all word-tuples are treated as queryable objects; or in terms of processing time, if postings lists for phrases are constructed at query time. There is also a lack of clarity as to which phrases are “interesting”, in the sense of capturing useful information. Here we explore several techniques for recognizing phrases using statistics of large-scale collections, and evaluate their quality.


Mutual Information Query Time Methyl Ether Tertiary Butyl Stop Word Inverted Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anand, A., Mele, I., Bedathur, S., Berberich, K.: Phrase query optimization on inverted indexes. In: Proc. CIKM, pp. 1807–1810 (2014)Google Scholar
  2. 2.
    Broschart, A., Berberich, K., Schenkel, R.: Evaluating the potential of explicit phrases for retrieval quality. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 623–626. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Chieze, E.: Integrating phrases in precision-oriented information retrieval on the web. In: Proc. Conf. Inf. Know. Eng., pp. 54–60 (2007)Google Scholar
  4. 4.
    Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comp. Ling. 16(1), 22–29 (1990)Google Scholar
  5. 5.
    Croft, W.B., Turtle, H.R., Lewis, D.D.: The use of phrases and structured queries in information retrieval. In: Proc. SIGIR, pp. 32–45 (1991)Google Scholar
  6. 6.
    Geva, S., Kamps, J., Lethonen, M., Schenkel, R., Thom, J.A., Trotman, A.: Overview of the INEX 2009 ad hoc track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 4–25. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Lehtonen, M., Doucet, A.: Phrase detection in the Wikipedia. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 115–121. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Liu, S., Liu, F., Yu, C.T., Meng, W.: An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In: Proc. SIGIR, pp. 266–272 (2004)Google Scholar
  9. 9.
    Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proc. SIGIR, pp. 472–479 (2005)Google Scholar
  10. 10.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Information Systems 27(1), 2.1–2.27 (2008)Google Scholar
  11. 11.
    Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Comp. Surv. 46(4), 1–47 (2014)CrossRefGoogle Scholar
  12. 12.
    Nevill-Manning, C.G., Witten, I.H.: Compression and explanation using hierarchical grammars. Comp. J. 40(2/3), 103–116 (1997)CrossRefGoogle Scholar
  13. 13.
    Patil, M., Thankachan, S.V., Shah, R., Hon, W.K., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proc. SIGIR, pp. 555–564 (2011)Google Scholar
  14. 14.
    Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In: Proc. Wkshp. Distr. Semantics & Compositionality, pp. 16–20 (2011)Google Scholar
  15. 15.
    Villada Moirón, M.B.: Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen (2005)Google Scholar
  16. 16.
    Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proc. ICDM, pp. 697–702 (2007)Google Scholar
  17. 17.
    Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes. ACM Trans. Information Systems 22(4), 573–594 (2004)CrossRefGoogle Scholar
  18. 18.
    Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: Practical automatic keyphrase extraction. In: Proc. ACM Conf. Dig. Lib., pp. 254–255 (1999)Google Scholar
  19. 19.
    Zhang, W., Liu, S., Yu, C.T., Sun, C., Liu, F., Meng, W.: Recognition and classification of noun phrases in queries for effective retrieval. In: Proc. CIKM, pp. 711–720 (2007)Google Scholar
  20. 20.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comp. Surv. 38(2), 6–1–6–56 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Simon Gog
    • 1
    • 2
  • Alistair Moffat
    • 1
  • Matthias Petri
    • 1
  1. 1.Department of Computing and Information SystemsThe University of MelbourneAustralia
  2. 2.Institute of Theoretical InformaticsKarlsruhe Institute of TechnologyGermany

Personalised recommendations