EuroGOV: Engineering a Multilingual Web Corpus

  • Börkur Sigurbjörnsson
  • Jaap Kamps
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4022)


EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian governmental web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.


Document Collection European Union Member State Link Structure Additional Domain Main Domain 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)Google Scholar
  2. 2.
    .GOV. TREC Web Corpus: GOV (2006),
  3. 3.
    Sigurbjörnsson, B., Kamps, J., de Rijke, M.: Overview of WebCLEF 2005. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 810–824. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    TextCat. Language identification tool (2006),
  5. 5.
    WebCLEF. Cross-lingual web retrieval (2006),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Börkur Sigurbjörnsson
    • 1
  • Jaap Kamps
    • 1
    • 2
  • Maarten de Rijke
    • 1
  1. 1.ISLA, Faculty of ScienceUniversity of Amsterdam 
  2. 2.Archives and Information Science, Faculty of HumanitiesUniversity of Amsterdam 

Personalised recommendations