Comparison of New Simple Weighting Functions for Web Documents against Existing Methods

  • Byurhan Hyusein
  • Ahmed Patel
  • Ferad Zyulkyarov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2869)

Abstract

Term weighting is one of the most important aspects of modern Web retrieval systems. The weight associated with a given term in a document shows the importance of the term for the document, i.e. its usefulness for distinguishing documents in a document collection. In search engines operating in a dynamic environment such as the Internet, where many documents are deleted from and added to the database, the usual formula involving the inverse document frequency is too costly to be computed each time the document collection is updated. This paper proposes two new simple and effective weighting functions. These weighting functions have been tested and compared with results obtained for the PIVOT, SMART and INQUERY methods using the WT10g collection of documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  2. 2.
    Khoussainov, R., O’Meara, T., Patel, A.: Independent Proprietorship and Competition in Distributed Web Search Architectures. In: Proceeding of the Seventh IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2001), pp. 191–199. IEEE Computer Society Press, Los Alamitos (2001)CrossRefGoogle Scholar
  3. 3.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  4. 4.
    Buckley, C., Walz, J.: SabIR Research at TREC 9. In: Proceeding of the 9th Text REtrieval Conference (TREC-9), pp. 475–477. The National Institute of Standards and Technology (2000)Google Scholar
  5. 5.
    Larson, R.: Term Weighting in Smart (October 1998), Available from http://www.sims.berkeley.edu/courses/is202/f98/Lecture18/sld021.htm (Accessed July 14, 2003)
  6. 6.
    Broglio, J., Callan, J.P., Croft, W.B., Nachbar, D.W.: Document Retrieval and Routing Using the Inquery System. In: Proceeding of the Third Text Retrieval Conference (TREC-3), pp. 29–38. The National Institute of Standards and Technology (1995)Google Scholar
  7. 7.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted Document Length Normalization. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, pp. 21–29. ACM Press, New York (1996)Google Scholar
  8. 8.
    Bailey, P., Craswell, N., Hawking, D.: Engineering a Multi-Purpose Test Collection for Web Retrieval Experiments. Information Processing and Management (2002)Google Scholar
  9. 9.
    Hawking, D.: CSIRO Mathematical, and Information Sciences. Overview of the TREC-9 Web Track. In: Proceeding of the 9th Text REtrieval Conference (TREC- 9), pp. 87–102. The National Institute of Standards and Technology (2000)Google Scholar
  10. 10.
    Internet Archive: Building an Internet Library, http://www.archive.org
  11. 11.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Byurhan Hyusein
    • 1
  • Ahmed Patel
    • 1
  • Ferad Zyulkyarov
    • 1
  1. 1.Computer Networks and Distributed Systems Research Group, Department of Computer ScienceUniversity College DublinBelfield, Dublin 4Ireland

Personalised recommendations