Skip to main content

University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming

  • Conference paper
Accessing Multilingual Information Repositories (CLEF 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4022))

Included in the following conference series:

Abstract

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amati, G.: Probabilistic Models for Information Retrieval based on Divergence from Randomness. PhD thesis, Dept of Computing Science, University of Glasgow (2003)

    Google Scholar 

  2. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, pp. 161–175 (1994)

    Google Scholar 

  3. Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the TREC-2003 Web Track. In: Proceedings of TREC 2003 (2003)

    Google Scholar 

  4. Craswell, N., Hawking, D.: Overview of the TREC-2004 Web Track. In: Proceedings of TREC 2004 (2004)

    Google Scholar 

  5. Hawking, D., Upstill, T., Craswell, N.: Towards better weighting of anchors. In: Proceedings of ACM SIGIR 2004, pp. 512–513 (2004)

    Google Scholar 

  6. He, B., Ounis, I.: A study of the Dirichlet Priors for term frequency normalisation. In: Proceedings of ACM SIGIR 2005, pp. 465–471 (2005)

    Google Scholar 

  7. Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7(1-2), 33–52 (2004)

    Article  Google Scholar 

  8. Hunspell & Hunstem: Hungarian version of Ispell & Hungarian stemmer, http://magyarispell.sourceforge.net/

  9. Jansen, B.J., Spink, A.: An analysis of Web searching by European AlltheWeb.com users. Inf. Process. Manage 41(2), 361–381 (2005)

    Article  Google Scholar 

  10. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier Information Retrieval Platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005), http://ir.dcs.gla.ac.uk/terrier/

    Chapter  Google Scholar 

  11. Plachouras, V., He, B., Ounis, I.: University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier. In: Proceedings of TREC (2004)

    Google Scholar 

  12. Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proceedings of the 13th international conference on World Wide Web, pp. 13–19 (2004)

    Google Scholar 

  13. Sigurbjonsson, B., Kamps, J., de Rijke, M.: EuroGOV: Engineering a Multilingual Web Corpus. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 825–836. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Snowball stemmers, http://snowball.tartarus.org/

  15. van Noord, G.: TextCat language guesser, http://odur.let.rug.nl/~1vannoord/TextCat/

  16. Zaragoza, H., Craswell, N., Taylor, M., Saria, S., Robertson, S.: Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Macdonald, C., Plachouras, V., He, B., Lioma, C., Ounis, I. (2006). University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming. In: Peters, C., et al. Accessing Multilingual Information Repositories. CLEF 2005. Lecture Notes in Computer Science, vol 4022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11878773_100

Download citation

  • DOI: https://doi.org/10.1007/11878773_100

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45697-1

  • Online ISBN: 978-3-540-45700-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics