Table 13 Guidelines for inclusion of word types in KELLY lists

From: Corpus-based vocabulary lists for language learners for nine languages

Word type Policy Comments
Variants Spelling variants should be amalgamated, so that e.g. organize and organise are counted as one word for frequency calculations. Each language team will have to have a style guide for preferred forms for the list itself. For English, British and US spelling variants such as color/colour will also be amalgamated
Lexical variants*, e.g. cash machine/ATM should be treated as separate items
 
Inflected forms These are not shown unless an inflected form has a meaning that is not inherent in the base form, e.g. better in the sense of ‘to get better’ Although learners may want to look up inflections, esp. irregular ones, for the purposes of frequency they should be treated together with the base form
Derivational inflected forms e.g. quickly, happiness To be treated as words in their own right, i.e. as separate lemgrams  
Affixes, including productive affixes No, an affix will only appear if it forms a word that is common enough in itself to merit inclusion  
Abbreviations Yes, including abbreviations that are written only, but only if they meet the normal criteria of what we are including, so not abbreviations for proper nouns and encyclopaedic items. The most common abbreviations will probably be forms of address, weights and measures, Latin abbrevs, and the few cases where an abbreviation is the normal way to refer to an item, e.g. DVD NB The inclusion of abbreviations will mean searching on the non-alphabet character [.]
Multiword units Yes for the teams who decided to add them at this stage, no for those who didn’t  
Hyphenated compounds Yes, as long as they can be found automatically  
Phrasal verbs No for English, as they count as multi-words—yes for languages where they have a one word lemma  
Phrases, idioms, proverbs, quotations No  
Subject-specific vocabulary Only if it makes it by the normal frequency criteria (it may do, for instance for some computing terms) NB When it comes to adding CEF levels, we may need to consider grammar vocabulary as a special case because of its usefulness to language learners
Dialect words No  
Items marked by register, e.g. very formal, slang, offensive Normal frequency rules apply: if they come in the top 5,000 then yes NB We agreed that an ‘offensive’ attribute should be added to the database so that while the frequency lists themselves can be purely frequency based, offensive items can be weeded out if necessary
Geographic terms Country name/related adjective/name of people/language For these: give your own, then any others that appear in your frequency list in the normal way
Oceans/continents/important areas/mountain ranges These should be included on a frequency basis, but privilege items which are not from your own area. So for the English list, an item such as ‘Mediterranean’ would be more important than ‘Lake District’. This suggestion is to avoid over-representation of these items—every list is likely to include many from one’s own region
Cities Your own capital city, plus any really major cities in your country which have a different name in translation. Then any cities from other countries which fulfil the normal frequency criteria and have a different name in your language from the original
We will not cover individual rivers, mountains, deserts etc
 
Famous places and buildings Only if they have metonymy, e.g. Hollywood. Likely to be very rare  
Stars, planets, galaxies, etc No  
Imaginary, biblical or mythological people or place names No  
Personal names No  
Famous people and places, and other encyclopaedic information such as names of wars, treaties, names of ancient peoples, names of organizations, etc No  
Adjectives derived from famous people Only if they are in the top 5,000  
Festivals and ceremonies If they are in the top 5,000  
Trademarks If they appear in the top 5,000 and are the name of an item, but not company names  
Beliefs and religions, and associated nouns and adjectives If they are in the top 5,000  
Currencies Include your own currency and any others in the top 5,000  
  1. * Otherwise referred to as synonyms