Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Evolving local and global weighting schemes in information retrieval

  • 148 Accesses

  • 27 Citations

Abstract

This paper describes a method, using Genetic Programming, to automatically determine term weighting schemes for the vector space model. Based on a set of queries and their human determined relevant documents, weighting schemes are evolved which achieve a high average precision. In Information Retrieval (IR) systems, useful information for term weighting schemes is available from the query, individual documents and the collection as a whole.

We evolve term weighting schemes in both local (within-document) and global (collection-wide) domains which interact with each other correctly to achieve a high average precision. These weighting schemes are tested on well-known test collections and are compared to the traditional tf-idf weighting scheme and to the BM25 weighting scheme using standard IR performance metrics.

Furthermore, we show that the global weighting schemes evolved on small collections also increase average precision on larger TREC data. These global weighting schemes are shown to adhere to Luhn’s resolving power as both high and low frequency terms are assigned low weights. However, the local weightings evolved on small collections do not perform as well on large collections. We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on large collections.

This is a preview of subscription content, log in to check access.

References

  1. Bergstrom A, Jaksetic P and Nordin P (2000) Enhancing information retrieval by automatic acquisition of textual relations using genetic programming. In: Proceedings of the 5th international conference on Intelligent user interfaces. pp. 29–32, ACM Press

  2. Darwin C (1859) The Origin of the Species by means of Natural Selection, or The Preservation of Favoured Races in the Struggle for Life. First edition

  3. Fan W, Fox EA, Pathak P and Wu H (2004a) The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology 55(7):628–636

  4. Fan W, Gordon MD and Pathak P (2004b) A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing & Management

  5. Goldberg DE (1989) Genetic Algorithms in Search, Optimisation and Machine learning. Addison-Wesley

  6. Gordon, M (1988) Probabilistic and genetic algorithms in document retrieval. Commun. ACM 31(10):1208–1218

  7. Greiff W (1998) A theory of term weighting based on exploratory data analysis. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98). Melbourne, Australia

  8. Gustafson S (2004) An Analysis of Diversity in Genetic Programming. Ph.D. thesis, School of Computer Science and Information Technology, University of Nottingham, Nottingham, England

  9. Hersh W, Buckley C, Leone TJ and Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 192–201, Springer-Verlag New York, Inc

  10. Horng J and Yeh C (2000) Applying genetic algorithms to query optimization in document retrieval. Information Processing & Management 36(5):737–759

  11. Kim S and Zhang B-T (2001) Evolutionary Learning of Web-Document Structure for Information Retrieval. In: Proceedings of the 2001 Congress on Evolutionary Computation CEC2001. pp. 1253–1260, IEEE Press

  12. Koza JR (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA

  13. Kuscu I (2000) Generalisation and domain specific functions in Genetic Programming. In: Proceedings of the 2000 Congress on Evolutionary Computation CEC00. pp. 1393–1400, IEEE Press

  14. Kwok KL (1996) A new method of weighting query terms for ad-hoc retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 187–195, ACM Press

  15. Lewis D (1992) Feature Selection and Feature Extraction for Text Categorization. Proceedings of Speech and Natural Language Workshop pp. 212–217

  16. Li L, Shang Y and Zhang W (2002) Improvement of HITS-based algorithms on web documents. In: Proceedings of the eleventh international conference on World Wide Web. pp. 527–535, ACM Press

  17. Lucas JM, van Baronaigien DR and Ruskey F (1993) On rotations and the generation of binary trees. J. Algorithms 15(3):343–366

  18. Luhn H (1958) The automatic creation of literature abstracts. IBM Journal of Research and Development pp. 159–165

  19. Oren N (2002) Re-examining tf.idf based information retrieval with Genetic Programming. Proceedings of SAICSIT

  20. Pirkola A and Jarvelin K (2001) Employing the resolution power of search keys. J. Am. Soc. Inf. Sci. Technol. 52(7):575–583

  21. Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137

  22. Robertson SE, Walker S, Hancock-Beaulieu M, Gull A and Lau M (1998) Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: The Seventh Text REtrieval Conference (TREC-7) NIST

  23. Salton G and C Buckley (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5):513–523

  24. Salton G, Wong A and Yang CS (1975) A vector space model for automatic indexing. Commun. ACM 18(11):613–620

  25. Salton G and Yang CS (1973) On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372

  26. Schultz C (1968) H.P. Luhn: Pioneer of Information Science - Selected Works. Macmillan, London

  27. Singhal A (2001) Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4):35–43

  28. Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 21–29, ACM Press

  29. Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21

  30. Van Rijsbergen, CJ (1979) Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow

  31. Vrajitoru D (1998) Crossover improvement for the genetic algorithm in information retrieval. Inf. Process. Manage. 34(4):405–415

  32. Vrajitoru D (2000) In F. Crestani, G. Pasi (eds.): Soft Computing in Information Retrieval. Techniques and Applications, pp. 199–222. Physica-Verlag

  33. Yang J-J and Korfhage R (1993) Query Optimization in Information Retrieval Using Genetic Algorithms. In: Proceedings of the 5th International Conference on Genetic Algorithms. pp. 603–613, Morgan Kaufmann Publishers Inc

  34. Yu, CT and Salton G (1976) Precision weighting - An effective automatic indexing method. Journal of the ACM 23(1):76–88

  35. Zipf G (1949) Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, Massachusetts

Download references

Author information

Correspondence to Ronan Cummins.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Cummins, R., O’Riordan, C. Evolving local and global weighting schemes in information retrieval. Inf Retrieval 9, 311–330 (2006). https://doi.org/10.1007/s10791-006-1682-6

Download citation

Keywords

  • Genetic Programming
  • Information Retrieval
  • Term-Weighting Schemes