Segmentation of Web Documents and Retrieval of Useful Passages

  • Carlos G. Figuerola
  • José L. Alonso Berrocal
  • Angel F. Zazo Rodríguez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5152)


This year’s WebCLEF task was to retrieve snippets and pieces from documents on various topics. The extraction and the choice of the most widely used snippets can be carried out using various methods. This article illustrates the segmentation process and the choice of snippets produced in this process. It also describes the tests carried out and their results.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pilgrim, M.: Universal encoding detector,
  2. 2.
    Zazo, Á.F., Figuerola, C.G., Alonso Berrocal, J.L., Rodríguez, E.: Reformulation of queries using similarity thesauri. Information Processing & Management 41(5), 1163–1173 (2005)CrossRefGoogle Scholar
  3. 3.
    Yu, S., Cai, D., Wen, J.R., Ma, W.Y.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, 20-24 May 2003, pp. 11–18. ACM, New York (2003)CrossRefGoogle Scholar
  4. 4.
    Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the First Meeting of the North American Chapter of the Computational Linguistics (NAACL 2000), pp. 264–271. Morgan Kaufmann, San Francisco (2000)Google Scholar
  5. 5.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 18–22, 1996, pp. 21–29. ACM, New York (1996) (Special Issue of the SIGIR Forum)CrossRefGoogle Scholar
  6. 6.
    Lee, J.H.: Combining multiple evidence from different relevance feedback methods. Technical report, Center for Intelligent Information Retrieval (CIIR), Department of Computer Science, University of Massachusetts (1996)Google Scholar
  7. 7.
    Beitzel, S.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O., Goharian, N.: On fusion of effective retrieval strategies in the same information retrieval system. Journal of the American Society for Information Science and Technology (JASIST) 55(10), 859–868 (2004)CrossRefGoogle Scholar
  8. 8.
    Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez, E.: REINA at WebCLEF 2006: Mixing fields to improve retrieval. In: Nardi, A., Peters, C., Vicedo, J. (eds.) ABSTRACTS CLEF 2006 Workshop, Alicante, Spain, 20-22 September. Results of the CLEF 2006 Cross-Language System Evaluation Campaign (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Carlos G. Figuerola
    • 1
  • José L. Alonso Berrocal
    • 1
  • Angel F. Zazo Rodríguez
    • 1
  1. 1.REINA Research GroupUniversity of Salamanca Email: reina@usal.esSalamancaSpain

Personalised recommendations