Skip to main content

EuroGOV: Engineering a Multilingual Web Corpus

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4022))

Abstract

EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)

    Google Scholar 

  2. .GOV. TREC Web Corpus: GOV (2006), http://es.csiro.au/TRECWeb/govinfo.html

  3. Sigurbjörnsson, B., Kamps, J., de Rijke, M.: Overview of WebCLEF 2005. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 810–824. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. TextCat. Language identification tool (2006), http://odur.let.rug.nl/~vannoord/TextCat/

  5. WebCLEF. Cross-lingual web retrieval (2006), http://ilps.science.uva.nl/WebCLEF/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sigurbjörnsson, B., Kamps, J., de Rijke, M. (2006). EuroGOV: Engineering a Multilingual Web Corpus. In: Peters, C., et al. Accessing Multilingual Information Repositories. CLEF 2005. Lecture Notes in Computer Science, vol 4022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11878773_90

Download citation

  • DOI: https://doi.org/10.1007/11878773_90

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45697-1

  • Online ISBN: 978-3-540-45700-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics