Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval

  • Felipe Bravo-Marquez
  • Gaston L’Huillier
  • Sebastián A. Ríos
  • Juan D. Velásquez
Conference paper

DOI: 10.1007/978-3-642-16321-0_32

Volume 6393 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Bravo-Marquez F., L’Huillier G., Ríos S.A., Velásquez J.D. (2010) Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval. In: Chavez E., Lonardi S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg

Abstract

The retrieval of similar documents in the Web from a given document is different in many aspects from information retrieval based on queries generated by regular search engine users. In this work, a new method is proposed for Web similarity document retrieval based on generative language models and meta search engines. Probabilistic language models are used as a random query generator for the given document. Queries are submitted to a customizable set of Web search engines. Once all results obtained are gathered, its evaluation is determined by a proposed scoring function based on the Zipf law. Results obtained showed that the proposed methodology for query generation and scoring procedure solves the problem with acceptable levels of precision.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Felipe Bravo-Marquez
    • 1
  • Gaston L’Huillier
    • 1
  • Sebastián A. Ríos
    • 1
  • Juan D. Velásquez
    • 1
  1. 1.Department of Industrial EngineeringUniversity of ChileSantiagoChile