Corpus-based vocabulary lists for language learners for nine languages

Kilgarriff, Adam; Charalabopoulou, Frieda; Gavrilidou, Maria; Johannessen, Janne Bondi; Khalil, Saussan; Johansson Kokkinakis, Sofie; Lew, Robert; Sharoff, Serge; Vadlapudi, Ravikiran; Volodina, Elena

doi:10.1007/s10579-013-9251-2

Corpus-based vocabulary lists for language learners for nine languages

Original Paper
Open access
Published: 14 September 2013

Volume 48, pages 121–163, (2014)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

Corpus-based vocabulary lists for language learners for nine languages

Download PDF

Adam Kilgarriff¹,
Frieda Charalabopoulou²,
Maria Gavrilidou²,
Janne Bondi Johannessen³,
Saussan Khalil⁴,
Sofie Johansson Kokkinakis⁵,
Robert Lew⁶,
Serge Sharoff⁴,
Ravikiran Vadlapudi¹ &
…
Elena Volodina⁵

27k Accesses
27 Citations
Explore all metrics

Abstract

We present the KELLY project and its work on developing monolingual and bilingual word lists for language learning, using corpus methods, for nine languages and thirty-six language pairs. We describe the method and discuss the many challenges encountered. We have loaded the data into an online database to make it accessible for anyone to explore and we present our own first explorations of it. The focus of the paper is thus twofold, covering pedagogical and methodological aspects of the lists’ construction, and linguistic aspects of the by-product of the project, the KELLY database.

Dictionaries as aids for language learning

Corpus Linguistics and Vocabulary Teaching

A Corpus Analysis of Chinese Students’ (Mis-)Use of Nouns at XJTLU

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Word lists are much-used resources in many disciplines, from language learning to psycholinguistics. A natural way to develop a word list is from a corpus. Yet a corpus-derived list on its own usually has grave shortcomings as a practical resource. In this paper we explore a substantial effort to generate word lists for nine languages, as far as possible in a corpus-driven, principled way, but with the overriding priority of creating lists which are as useful as possible for language learners.

The goal of the KELLY project^{Footnote 1} was to develop sets of bilingual language learning word cards in many different language combinations. For this we needed to know which words to include, and we wanted them to be the 9,000 most frequent words in nine languages. We then added a research goal: to use as principled a corpus-driven method as possible. The lists needed to be ordered, so learners could learn the more common words first. Four of the languages were ‘more commonly taught’ (Arabic, Chinese, English, Russian), the other five ‘less commonly taught’ (Italian, Swedish, Norwegian, Greek, Polish). The selection of the languages was dictated by three factors: the company that initiated the idea (Keewords AB, Sweden) and their interests; the EU Lifelong Learning Programme’s agenda of improving resources for smaller languages and less obvious language pairs; and participants’ research networks.

The KELLY procedure for preparing the list for each language was as follows:

Identify the corpus
Generate a frequency list (the ‘Monolingual 1’ or ‘M1’ list)
Clean up the list, and compare it with lists from other corpora and other wordlists
Make adjustments to give the ‘M2’ list
Translate each item into all the other KELLY languages (the ‘Translation 1’ or ‘T1’ list)
Use the ‘back translations’ to identify items for addition or deletion
Make further adjustments to give the final, M3 list.

While the process was corpus-based, it was not one in which the corpus was religiously seen as the authority. Every corpus has peccadilloes, and the corpus to which you have access is rarely the ideal corpus for the task at hand. So, at various points, we were happy for expert judgement to overrule corpus frequencies. The paper considers these divergences and what underlies them.

Once the process was complete, the translations were entered into a database which let us ask questions like “What ‘symmetrical pairs’ are there, where X is translated as Y, and Y is also translated as X?” and “What word sets of three or more words (all of different languages) are there where all words are in symmetric pairs with all others?”. The database is available to all to interrogate.^{Footnote 2}

The structure of the paper is as follows: Sect. 2 discusses word lists and presents an overview of the relevant literature, Sect. 3 gives details of the KELLY procedure for preparing lists, Sect. 4 considers the KELLY database as a resource for linguistic research, and Sect. 5 concludes.

2 Word lists

Word frequency lists can be seen from several perspectives. For computational linguistics or information theory, they are also called unigram lists and can be seen as a compact representation of a corpus, lacking much of the information (being decontextualised), but small and easily tractable. Unigram lists (and also n-gram lists where n = 2, 3, 4) are basic for all language modeling, from speech recognition to machine translation. Systems that use word lists in areas relating to language learning include automatic rating of good corpus examples where the vocabulary is checked for being common (frequent) versus rare (infrequent) (Kilgarriff et al. 2008; Kosem et al. 2011; Borin et al. 2012), and readability analysis where texts are analyzed for their lexical frequency profiles (Heimann Mühlenbock 2012; Volodina 2010).

Psychologists exploring language production, understanding, and acquisition are also interested in word frequency, as a word’s frequency is related to the speed with which it is understood or learned. So frequency needs to be used as a criterion in choosing words to use in psycholinguistic experiments. A number of frequency-based word lists constitute a part of the Psycholinguistic Database^{Footnote 3} with the named resources being used in different experiments, for example Davis (2005) and Aitchison (2012).

Educationalists are interested in frequency too, as it can guide the curriculum for learning to read and similar. To these ends, for English, Thorndike and Lorge prepared The Teacher’s WordBook of 30,000 words in 1944 by counting words in a corpus, creating a reference set used for many studies for many years (Thorndike and Lorge 1944). It made its way into English language teaching via West’s General Service List (West 1953), which was a key resource for choosing which words to use in the English language teaching curriculum until the British National Corpus replaced it in the 1990s. More recently, the English Profile project^{Footnote 4} has developed the ‘English Vocabulary Profile’ which lists vocabulary for each CEFR level^{Footnote 5} (Capel 2010).

In language teaching, word frequency lists are used among other things for:

defining a syllabus
building graded readers
deciding which words are used in:
- learning-to-read books for children
- textbooks for second language (L2) learners
- dictionaries
- language tests for L2 learners

2.1 The pedagogical perspective: learning vocabulary using lists and cards

Vocabulary learning is an essential part of mastering a second language (L2). According to Nation (2001), vocabulary knowledge constitutes an integral part of learners’ general L2 proficiency and is a prerequisite for successful communication.

In terms of language pedagogy, there are two generally accepted approaches to vocabulary learning: intentional, where activities are aimed directly at learning lexical items, such as using word lists and cards; and incidental, where learning vocabulary is a by-product of activities not primarily focused on the systematic learning of words, such as reading (Nation 2001).

Although sometimes seen as opposed to each other (Nation 2001:232), both intentional and incidental vocabulary learning should have a place in language learning and should be seen as complementary to each other (Hulstijn 2001).

From the communicative perspective, incidental or ‘contextual’ vocabulary learning contributes to successful lexical development, while intentional learning, especially if it involves rote learning such as using word lists and cards, may result in misuse of the vocabulary since words are learned in isolation. Intentional learning may even fail to transfer information contained in chunks of language (e.g. collocations, expressions etc.), seen as essential for communicative fluency (McCarten 2007). Intentional learning methods have therefore largely fallen out of fashion or been dismissed by advocates of the communicative approach.

A substantial body of research, however, lends support to the claim that intentional or ‘decontextualised’ vocabulary learning using word lists and cards should not be marginalised. In her discussion of L2 vocabulary acquisition, Laufer (2003), for example, has shown that this type of learning may in certain cases prove to be more efficient than incidental/contextualised vocabulary learning, since incidental learning requires exposure to rich L2 input environments as well as extensive reading and listening, which delays the whole learning process. She estimates that learners may need to read a text of 200,000 words in order to learn 108 words from context, which seems unrealistic given classroom limitations. If a learner has limited exposure to the L2 outside the classroom, then intentional, word-focused activities should complement contextual vocabulary learning (Hulstijn 2001; Laufer 2003; Nation 2001). List learning in particular can be of particular benefit for lower-level L2 learners and prove to be an efficient way to achieve vocabulary mastery.

A key issue for vocabulary learning is retention, and a key aim of vocabulary learning activities and materials should be long-term retention. There are a number of studies that have indicated the usefulness of lists in word-learning, such as Schmitt and Schmitt 1995; Waring 2004; and Mondria and Mondria-de Vries 1994; as well as Hulstijn 2001 and Nation 2001, who found that the use of word lists seems to exhibit good retention and faster gains. In fact, “there are a very large number of studies showing the effectiveness of such learning (i.e. using vocabulary cards) in terms of the amount and speed of learning” (Nation 1997).

Using lists and cards also facilitates self-directed learning and learner autonomy, as learners may work at their own pace. It does, however, require motivated and disciplined learners, who should also be able to deploy the right metacognitive strategies for self-monitoring, planning their own learning, etc., since “If they [learners] cannot monitor their learning accurately and plan their review schedule accordingly, they cannot make the most of word cards and may run the risk of inefficient learning, e.g. over-learning (devoting more time than necessary) of easy items or under-learning of hard items” (Nakata 2008:7).

2.2 What word lists are there?

If using word lists and cards can be a useful tool for dedicated L2 vocabulary learning, the next question is if such lists are already available. And if so, how good are they? Might the KELLY lists improve on what is currently available? In this section we review the lists in existence for the languages of the project, except English, which has been mentioned above.

Arabic

At the time of the start of the KELLY project, no Arabic word lists or corpora could be found and so a new, internet-based corpus was produced for the purpose of the project. However, during the course of the project, A Frequency Dictionary of Arabic: Core Vocabulary for Learners was published (Buckwalter and Parkinson 2011). An excellent resource for learners, it contains the 5,000 most frequently used words in Arabic. It is just over half the size of the final 9,000 word KELLY list for Arabic, but also contains dialectal Arabic words, which were largely removed from the KELLY list in line with most programmes teaching Arabic as a foreign language, which teach Modern Standard Arabic (MSA). In terms of structure, the frequency dictionary is strictly ordered by word frequency, containing smaller thematic lists and an alphabetical index. In the KELLY list, the word frequency order has largely been kept, but in line with the wider KELLY project aim, relevance to L2 learners overrode frequency and irrelevant items were omitted or moved within the list. For example, numbers were included as a category, irrespective of individual numbers’ frequency in the corpus. Vocabulary items seen as essential to language learning with few or no occurrences were added through comparison with other language lists—for example names of foods and items of clothing that appeared on several of the other language lists, but not in the Arabic list. Conversely, vocabulary items that did not fit into the CEFR levels and would seem out of place in a language learning environment were omitted, such as heavily religious vocabulary items.

Chinese

Interest in producing Chinese frequency lists is amplified by the unique need to arrange a very large inventory of characters in a way that is useful for language learners. One of the first corpus-based frequency lists for Chinese was produced in the 1920s from a corpus of more than 500,000 words (Xiao et al. 2009). This research line continued in the 20th century culminating in A Frequency Dictionary of Mandarin Chinese (Xiao et al. 2009). Like the Arabic dictionary from the same series mentioned above, it is a very useful resource for language learners, although it is based strictly on frequency and does not group words into thematic categories.

Greek

There are some word lists available for Greek, mainly created and used for language learning purposes (Charalabopoulou and Gavrilidou 2011). The first, provided by the Center for the Greek Language, which has exclusive responsibility assigned by the Greek Government for the organisation, planning, and administration of examinations for the Certification of Attainment in Modern Greek, includes two word lists, simply described as “Indicative Vocabulary for Levels A & B” (Efstathiadis et al. 2001). The lists are not corpus-based and the number of lemmas is not specified.

The second wordlist is found in an appendix to the curriculum for teaching Modern Greek as an L2 to adults published by the University of Athens, and is based solely on the authors’ intuition and teaching experience. The authors believe the words are “representative vocabulary”, and comply with the communicative needs and learning goals specified in the curriculum in relation to particular notions and functions, speech acts and thematic domains. The number of words is not specified (University of Athens 1998).

Thirdly, a dictionary of Greek as a foreign language^{Footnote 6} has recently been produced as part of the Education of the Muslim Minority Children in Thrace project, as part of the Programme for the Education of Muslim Children 1997–2008.^{Footnote 7} The dictionary includes 10,000 lemmas arrived at through combining existing monolingual dictionaries for Greek schoolchildren, representing basic/core vocabulary items, and e-corpora, including school textbooks.

Lastly, three different but complementary corpora were created as part of the research project ‘Corpora in Modern Greek Language Research and Teaching’, co-funded by the European Social Fund and National Fund (EPEAEK I) (Mikros 2007): a general corpus of Modern Greek, a special corpus for teaching Modern Greek as a foreign language, and a corpus of material produced by learners. Various word lists were produced from the corpora in order to study high and low frequency vocabulary usage in various Natural Language Processing applications.

Italian

The Lessico di frequenza dell’italiano parlato (LIP) [Frequency Lexicon of Spoken Italian] is one of the most important collections of texts of spoken Italian and one of the most widely used in linguistic research. It was composed by a group of linguists led by Tullio De Mauro who used it to build the first frequency list of spoken Italian (De Mauro et al. 1993). Its 469 texts, containing a total of approximately 490,000 words, were collected in four cities (Milan, Florence, Rome and Naples), and comprise face-to-face and mediated dialogues and monologues.

The Vocabolario di Base della lingua italiana (VdB) [Basic Vocabulary of Italian], also by De Mauro, is a 7,000 wordlist drawn up with mainly statistical criteria and appears in the Guida all’uso delle parole [Guide to the Use of Words] (De Mauro 1997). It represents the part of the Italian language used and understood by most Italians. It includes the first 4,700 words in the LIP (Bortolini et al. 1972) with a further 2,300 frequently used words mainly sourced from widely-used Italian dictionaries. The words in the VdB are grouped into three levels: fundamental vocabulary (from the LIP), high-use vocabulary (also from the LIP) and high-availability vocabulary (those words sourced from dictionaries).

The VdB was the first work of this kind in Italy and is now widely used, for example to monitor and improve the readability of a text according to scientific criteria.

Two centres for teaching Italian as a foreign language, the Università per Stranieri di Perugia and the Università per Stranieri di Siena, were contacted and replied that there are no official word lists for assessing students’ knowledge of Italian or for preparing teaching material. However, the most used frequency lists for deriving lexical syllabi are the LIP and VdB. Both centres have developed lists of words most used by learners based on speech produced by L2 students of Italian at different levels.

Norwegian

Although no official word list could be found, several word lists exist for Norwegian in textbooks for learning Norwegian as a foreign language. However, it is unclear how these word lists were formed.

There is also Lexin,^{Footnote 8} the online series of bilingual dictionaries (Norwegian-minority languages) with 36,000 entries, based on the Swedish version (see below). It includes a series of illustrations divided into 33 topic areas such as family and relatives, our bodies outside, the human body inside, mail and banking, and school and education.

Polish

No official or otherwise widely-used word list was found.

Russian

Early modern frequency lists from the 1950s and 1960s are available for Russian (Josselson 1953; Shteinfeld 1963), as well as a later dictionary (Zasorina 1977) produced from a one-million-word corpus. However, Russia’s turbulent history in the past 50 years has resulted in substantial changes in the Russian lexicon, which are not reflected in these early lists.

Corpora since then have expanded significantly with the increase in the number of texts available in electronic form.

Further development of the KELLY list for Russian led to a frequency dictionary in the same series as those referred to above for Arabic and Chinese (Sharoff et al. 2013), with corpus examples and their translation into English, topical word lists, and information on the frequency of multiword units.

Swedish

For Swedish there are a number of word lists available. The oldest and most famous is Sturé Allen’s Tiotusen i topp [Top ten thousand; Allen 1972]. It was produced using newspaper texts collected around 1965, and has not been updated.

Other leading resources include:

Svensk skolordlista [Swedish wordlist for schools], with 35,000 words, is the outcome of a collaboration between the Swedish Academy and the Swedish language board. It is aimed at pupils in the 5th grade and higher, and contains short explanations in simplified Swedish for most words. It is a selection from the SAOL (Swedish Academy’s Wordlist of Swedish Language) and is updated regularly, with approximately 125,000 words. It reflects the most frequent vocabulary in modern newspapers and books, and includes a number of colloquial words. However, no frequency information is provided.

Lexin Svenska ord med uttal och förklaringar ^{Footnote 9} [Lexin Swedish words with pronunciation and explanations] contains 28,500 words and is aimed at immigrants. The vocabulary has been selected using frequency studies, vocabulary from course books, words specific to social studies (partly manually selected and partly from specific interpreter lists), and colloquial and/or ‘difficult’ vocabulary items taken from a range of sources (Gellerstam 1978). It is regularly updated from corpus studies, though there are no frequencies or information on the vocabulary appropriateness for different learner levels.

The Base Vocabulary Pool ^{Footnote 10} (Forsbom 2006) is a frequency-based list constituting central vocabulary derived from the SUC (Stockholm Umeå Corpus). The base vocabulary pool is created on the assumption that domain- or genre-specific words should not be in the base vocabulary pool. The core of this list is constituted by stylistically neutral general-purpose words collected from as many domains and genres as possible. Out of 69,371 entries in the lemma list based on SUC, 8,215 lemmas are included in the base vocabulary pool.

3 Preparing the KELLY lists

The KELLY lists aim to reflect the contemporary language, constitute the most frequent core vocabulary and are based on objective selection unless dictated by pedagogical needs.

The corpora they are based on should be large enough, and comprise enough different documents from a range of domains, to minimise the risk of words of specialised vocabulary appearing in the lists. We used the same methodology to create the corpora for each of the nine languages, so that the respective word lists could be, as far as possible, comparable.

Work on the lists was divided into five distinct phases, as outlined in Fig. 1.

We will now walk the reader through these phases, step by step.

3.1 Identify/create the corpus

For each language, we needed a corpus. We wanted it to be a corpus of general, everyday language and we wanted it to be large, with enough different texts so that it would not be skewed by particular texts or topics, and so that it would not miss any core vocabulary. Moreover, we wanted the corpora of the different languages to be, as far as possible, ‘comparable’: we wanted all the lists to represent the same kind of language, so we could make connections between them.

For some languages there was a good choice of corpora available, but not for others. Spoken corpora were only available for a minority of the languages.

One corpus type that is available or can be created for most languages, and which does provide a large general corpus, is a web corpus, using methods as presented in Sharoff (2006) and Baroni et al. (2009). These papers also show that web corpora can represent the language well—in some regards, better than a corpus such as the BNC, which has a heavier weighting of fiction, newspaper, and in general the more formal and less interactive registers. For each of the languages, we had access to or created a web corpus using the methods described by Sharoff and Baroni et al.

A central question was: what should the list be a list of? The most basic option was word forms, so invade invading invades and invaded would all be separate items. This was at odds with usual practice, and not useful for learners (especially for highly inflectional languages like Russian, Polish, Greek and Arabic), so we needed to lemmatise the corpus: to identify, for each word, the lemma. We also decided that the list items would all be associated with a word class (noun, verb etc.) with brush (noun) and can (noun) treated as distinct items from brush (verb), can (verb) and can (modal). For this we needed a part-of-speech tagger.

Table 1 shows that the corpora are comparable in terms of the source of texts (web-acquired), and all very large. Some random sample analysis of corpus texts and the most frequent nouns/verbs/adjectives, as well as an overview of hapax legomena in the Swedish corpus, SwedishWaC, indicated that its text constitution is very much like that of the English corpus, UKWaC, and that the majority of texts are made up of newspaper texts, Wikipedia articles, forums, chats and blogs (Volodina and Johansson Kokkinakis 2012). It also allows us to hypothesise about the dominating text genres in other web-acquired corpora collected in the same way.

Table 1 Main corpora and processing tools for each language

Corpus-based vocabulary lists for language learners for nine languages

Abstract

Similar content being viewed by others

Dictionaries as aids for language learning

Corpus Linguistics and Vocabulary Teaching

A Corpus Analysis of Chinese Students’ (Mis-)Use of Nouns at XJTLU

Explore related subjects

1 Introduction

2 Word lists

2.1 The pedagogical perspective: learning vocabulary using lists and cards

2.2 What word lists are there?

3 Preparing the KELLY lists

3.1 Identify/create the corpus

3.2 Generate a frequency list

3.3 Clean up the list, and compare with lists from other corpora and other wordlists

3.3.1 Clean up

3.3.2 Polysemy, multi-word units

3.3.3 Points of comparison

3.4 Translate each item into all the other KELLY languages

3.5 Use the ‘back translations’ to identify items for addition or deletion

4 The KELLY database

4.1 Symmetric pairs (sympairs)

4.2 Cliques

4.3 Non-sympairs: why are words not in sympairs?

4.3.1 Non-sympair analysis

4.4 Analysis by language family

4.5 Are words and their translations of similar frequencies?

4.5.1 Frequency discrepancy analysis in oto-sympairs

5 Summary and outlook

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1

Appendix 2

Appendix 3: English words that featured in 7-language cliques

3.1 This list is included as a more readable, duplicate free, but English-only list of items appearing to have a high degree of language-neutrality

Appendix 4

Appendix 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation