Part of Speech Based Term Weighting for Information Retrieval

  • Christina Lioma
  • Roi Blanco
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5478)

Abstract

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aslam, J.A., Pavlu, V.: Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  2. 2.
    Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–131 (1996)CrossRefGoogle Scholar
  3. 3.
    Bas, A., Denison, D., Keizer, E., Popova, G. (eds.): Fuzzy Grammar, a Reader. Oxford University Press, Oxford (2004)Google Scholar
  4. 4.
    Bookstein, A., Swanson, D.: Probabilistic models for automatic indexing. JASIS 25, 312–318 (1974)CrossRefGoogle Scholar
  5. 5.
    Brookes, B.C.: The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation 24, 41–54 (1968)CrossRefGoogle Scholar
  6. 6.
    Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
  7. 7.
    Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using Smart: TREC 4. In: TREC-4, pp. 25–48 (1995)Google Scholar
  8. 8.
    Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Cooper, W.S., Chen, A., Gey, F.: Full text retrieval based on probalistic equations with coefficients fitted by logistic regression. In: TREC-2, pp. 57–66 (1993)Google Scholar
  10. 10.
    Corston-Oliver, S., Ringer, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 43–50 (2004)Google Scholar
  11. 11.
    Craswell, N., Robertson, S.E., Zaragoza, H., Taylor, M.J.: Relevance weighting for query independent evidence. In: SIGIR, pp. 416–423 (2005)Google Scholar
  12. 12.
    Croft, B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)CrossRefMATHGoogle Scholar
  13. 13.
    Harter, S.P.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS 26(4), 197–206 (1975)CrossRefGoogle Scholar
  14. 14.
    Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: ACL, pp. 392–399 (2002)Google Scholar
  15. 15.
    Jespersen, O.: The Philosophy of Grammar. Allen and Unwin (1929)Google Scholar
  16. 16.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing (4), 401–412 (2003)Google Scholar
  17. 17.
    Lioma, C., Ounis, I.: Light syntactically-based index pruning for information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 88–100. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Lioma, C., van Rijsbergen, C.J.K.: Part of speech n-grams and information retrieval. RFLA 8, 9–22 (2008)Google Scholar
  19. 19.
    Lyons, J.: Semantics. 2. Cambridge University Press, Cambridge (1977)CrossRefGoogle Scholar
  20. 20.
    Margulis, E.L.: N-Poisson document modelling. In: SIGIR, pp. 177–189 (1992)Google Scholar
  21. 21.
    Mikk, J.: Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics 8(1), 67–80 (2001)CrossRefGoogle Scholar
  22. 22.
    Monz, C.: Model tree learning for query term weighting in question answering. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 589–596. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of Web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)CrossRefMATHGoogle Scholar
  24. 24.
    Papineni, K.: Why inverse document frequency? In: NAACL, pp. 25–33 (2001)Google Scholar
  25. 25.
    Pasca, M.: High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University (2001)Google Scholar
  26. 26.
    Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: EMNLP, pp. 130–142 (1996)Google Scholar
  27. 27.
    Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: SIGIR, pp. 353–360Google Scholar
  28. 28.
    Robertson, S., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society of Information Science 27, 129–146 (1976)CrossRefGoogle Scholar
  29. 29.
    Robertson, S., Walker, S.: Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. Springer, Heidelberg (1994)Google Scholar
  30. 30.
    Santini, M., Power, R., Evans, R.: Implementing a characterization of genre for automatic genre identification of Web pages. In: COLING/ACL, pp. 699–706 (2006)Google Scholar
  31. 31.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing Studies (1997)Google Scholar
  32. 32.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR, pp. 21–29. ACM Press, New York (1996)Google Scholar
  33. 33.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)CrossRefGoogle Scholar
  34. 34.
    Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302. ACM, New York (2007)Google Scholar
  35. 35.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)Google Scholar
  36. 36.
    Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In: EMNLP-CoNLL, pp. 112–121 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Christina Lioma
    • 1
  • Roi Blanco
    • 2
  1. 1.Computer ScienceKatholieke Universiteit LeuvenBelgium
  2. 2.Computer ScienceLa Coruna UniversitySpain

Personalised recommendations