Improving Web Pages Retrieval Using Combined Fields

  • Carlos G. Figuerola
  • José L. Alonso Berrocal
  • Ángel F. Zazo Rodríguez
  • Emilio Rodríguez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4730)

Abstract

This article describes the participation of the REINA Research Group of the University of Salamanca in WebCLEF 2006. This year we participated in the Monolingual Mixed Task in Spanish. The entire EuroGOV collection was processed to select all the pages in Spanish. All the pages with domain .es were also pre-selected. Our objective this year was to try pre-retrieval techniques of combining information fields or elements from web pages as well as the retrieval capability of these fields. In vector-based retrieval systems, the combining of terms coming from different sources can be achieved by operating on the frequency of the terms in the document using a weight scheme of tf ×idf. The BODY field is, of course, the most useful from the retrieval perspective, but the text of the backlinks brings considerable improvement. META fields or tags, however, contribute little to retrieval improvement.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sigurbjrnsson, B., Kamps, J., Rijke, M.d.: Overview of webclef (2005) [11]Google Scholar
  2. 2.
    Noord, G.v.: (Texcat language guesser)Google Scholar
  3. 3.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, April 11-13, 1994, pp. 161–175 (1994)Google Scholar
  4. 4.
    Beitzel, S.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O., Goharian, N.: On fusion of effective retrieval strategies in the same information retrieval system. Journal of the American Society for Information Science and Technology (JASIST) 55(10), 859–868 (2004)CrossRefGoogle Scholar
  5. 5.
    Figuerola, C.G., Alonso Berrocal, J.L.A., Zazo Rodríguez, Á.F., de Aldana, E.R.V.: Herramientas para la investigación en recuperación de información: Karpanta, un motor de búsqueda experimental. Scire 10(2), 51–62 (2004)Google Scholar
  6. 6.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization (Special Issue of the SIGIR Forum) (Special Issue of the SIGIR Forum). In: Research and Development in Information Retrieval. Proceedings of the 19th Annual International ACM SIGIR Conference, Zurich, Switzerland, August 18–22, 1996, pp. 21–29. ACM Press, New York (1996)Google Scholar
  7. 7.
    Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 27–34. ACM Press, New York (2002)CrossRefGoogle Scholar
  8. 8.
    Tomlinson, S.: Robust, web anf terabyte retrieval with hummingbird searchserver at trec 2004. In: The Thirteen Text Retrieval Conference (TREC 2002), pp. 261–500. NIST Special Publication (2004)Google Scholar
  9. 9.
    Figuerola, C.G., Zazo, Á.F., de Aldana, E.R.V., Alonso Berrocal, J.L.: La recuperación de información en español y la normalización de términos. Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial 8(22), 135–145 (2004)Google Scholar
  10. 10.
    Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez, E.: REINA at the WebCLEF task: Combining evidences and link analysis [11] Google Scholar
  11. 11.
    Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.): CLEF 2005. LNCS, vol. 4022. Springer, Heidelberg (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Carlos G. Figuerola
    • 1
  • José L. Alonso Berrocal
    • 1
  • Ángel F. Zazo Rodríguez
    • 1
  • Emilio Rodríguez
    • 1
  1. 1.REINA Research Group, University of Salamanca, Email: reina@usal.esSpain

Personalised recommendations