Table 1 Main corpora and processing tools for each language

From: Corpus-based vocabulary lists for language learners for nine languages

Language* Name Size in tokens (m) Processing tools
Arabic Internet-AR 174 Sawalha and Atwell (2010)
Chinese Internet-ZH 277 From Northeastern University, China
English UKWaC 1,526 TreeTagger
Greek GkWaC 149 ILSP tools
Italian ItWaC 1,910 TreeTagger
Norwegian NoWaC 700 Oslo–Bergen tagger
Polish Polish web corpus 128 TaKIPI, Piasecki (2007)
Russian Internet-RU 188 Sharoff et al. (2008)
Swedish SwedishWaC 114 Kokkinakis and Johansson Kokkinakis (1997)
  1. * The corpus was, as far as possible, Modern Standard Arabic only