The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.
The Sketch Engine is a leading corpus tool. It has been widely used in lexicography. It is now 10 years since its launch (Kilgarriff et al. 2004).
Those 10 years have seen dramatic changes. They have seen the near-death of dictionaries on paper, at the hands of electronic dictionaries.Footnote 1 They have seen the emergence of entire new ecosystems of dictionaries on the web, with many new players (Google, weblio.jp, dictionary.com, Leo, Wordnik.com). Previously, the dominant players had been around for decades, even centuries—Longman (who published Johnson’s dictionary in 1754), Kenkyusha, OUP, Le Robert, Duden, Merriam-Webster.
In the world at large, we have seen the invention and world takeover of the smartphone. 1994–2004 saw the switch of most dictionary lookups from paper to electronic: 2004–2014 has seen them nearly all (in percentage terms) switch from computer to phone. (Just think how often your students look up words on their phones, versus how often they look them up in any other way.) Dictionaries are far, far more available and accessible than they were. The sheer number of dictionary lookups has risen many times over (even as—bitter irony—many dictionary companies have seen their income collapse).
This is all at the publishing end of the dictionary business. What about the lexicography end? Here, we have seen the corpus revolution (Hanks 2012). It started in Northern Europe in the 1980s and 1990s, and has been spreading. For Chinese, a first thoroughly corpus-based dictionary was probably Huang et al. (1997)’s classifier-noun collocation dictionary. For Arabic, it is Oxford University Press’s Oxford Arabic Dictionary (Arts 2014),Footnote 2 though this was not produced in Asia. In Japan, corpus lexicography started in bilingual dictionary projects such as the WISDOM English-Japanese Dictionary (Sanseido 2003, 2007), but a truly corpus-based monolingual dictionary of Japanese is yet to appear.
Thus the 10 years of the Sketch Engine have also been the 10 years of bringing corpora into Asian lexicography. The paper is a perspective on those changes.
In this paper we review
the languages covered
the corpora accessible in it, and
developments in the software over the past decade.
We finish by reviewing related work: other corpora, corpus websites and corpus tools as available for lexicography and corpus linguistics.
‘Sketch Engine’ refers to two different things: the software, and the web service. The web service includes, as well as the core software, a large number of corpora pre-loaded and ‘ready for use’, and tools for creating, installing and managing your own corpora. The paper covers both, with Sects. 2 and 6 focussing on the software, 3, 4, and 5, the web service.
The Sketch Engine software: core functions
The word sketch
This is a feast of information on the word. For catch (verb) just looking at the first column (objects of the verb) we immediately see a number of meanings, idioms and set phrases. We catch a glimpse of or catch sight of something. Fisherman, fishersFootnote 4 and anglers (column 2) catch fish, trout and bass. You often want to catch someone’s attention. You sometimes catch your breath and things sometimes catch your eye. Sportsmen and women, in a range of sports, catch passes and balls. Things catch fire. We all sometimes catch buses.
The ‘object’ column is noise-free, and all items on it are immediately interpretable by a native speaker. The second column, for subject, introduces a couple of complications. Surprise relates to the expression caught by surprise. Eye and breath are objects misanalysed as subjects. Touchdown catches is a term from American football: the word sketch succeeds in bringing it to our attention, though catches is a noun which has been misanalysed as a verb. Police introduces a new meaning of the verb (police catch criminals) and Anyone brings to our attention the related pattern Anyone caught [doing X] will be [punished].
The third column, and/or, tells us more about the police and sports meanings. Overheat goes with catch fire. Tangle and snag introduce a new meaning where, if a rope or line or piece of cotton or string or wire catches with something else, it no longer runs free.
The fourth table brings our attention to the phrasal verbs catch up, catch on, catch out; the fifth, to the reflexive use (I caught myself wondering…). The next set of tables show us what we might be caught in (the crossfire, a trap, the headlights), on (videotape, CCTV), by and with (your pants down). The final column takes us back to the police, with people being caught red-handed and unprepared.
The word sketch can be seen as a draft dictionary entry. The system has worked its way through the corpus to find all the recurring patterns for the word and has organised them, ready for the lexicographer to edit, elucidate, and publish. This is how word sketches have been used since they were first produced.
When looking at a word sketch, a user often wants to find out more: where and how, for example, was catch used with with and pant? They can do this by clicking on the number, and seeing the concordance, as in Fig. 2.
This is usually enough to show why a collocate occurred in a word sketch.
The concordance is the basic tool for anyone working with a corpus. It shows you what is in your corpus. It takes you to the raw data, underlying any analysis. Getting there from a word sketch is just one of the ways of getting to a concordance. The basic method is from the simple search form, as in Fig. 3.
This is modelled on the Google search form. Users like a simple input form, where they put in what they are looking for, and the tool finds it for them. It is for the tool to do its best to understand what it was that the user wanted, and to find it for them. In the case of the Sketch Engine, simple searches are interpreted as follows:
case-insensitive (so a search for catch finds catch, Catch and CATCH)
as searches for either word form or lemma (so a search for catch finds catch, catching, catches, caught, and a search for caught finds just caught), and
where there is more than one item (with space as separator), a sequence.Footnote 5
These three aspects combine, so a simple search for catch fire finds all the hits in Fig. 4.
Users often want more control than the simple search offers. By clicking on ‘Query types’ they see the options as in Fig. 5, and can specify a lemma (with optional word class, e.g. verb, noun, adjective) or a specific phrase or word form (with an option to match for case). ‘Character search’ is designed for languages which do not put spaces between words (Chinese, Japanese, Thai) so users can see a concordance for a character (without having to guess how the text has been segmented into words). CQL is the underlying corpus query language, which technically inclined users can input directly in the CQL box.Footnote 6 Other query types are automatically transformed into CQL queries which are then evaluated by the underlying database engine to obtain the results from the corpus.Footnote 7
Lexicographers often want to home in on a particular pattern of use to explore it further. This can be done with the Context options, as in Fig. 6.
To find all instances of catch that have pant or pants within five words, we search for catch with pant in as a lemma filter, with results as shown in Fig. 7.
Some corpora have the documents within them classified for text type, for example, the “Brown Family” corpus, comprising the original Brown corpus (American English, 1961) and its various clones (for British or American English and at various date points), all with the same structure and genre distribution. Clicking on Text Type in the concordance form (Fig. 5) shows the form in Fig. 8. The user can limit the search to a particular national variety, time, or genre, by ticking boxes.
Once the user has a concordance, there are many things that can be done with it. It can be sorted, sampled, filtered (for example by Context, or Text Type) or saved. A range of frequency analyses are available, including collocation reports and analysis by text types (where the corpus has text types defined). At the level of the individual hit, the user can click on the search term for more context (see Fig. 9), or on the item in the ‘reference’ column to see the metadata for the item.
The Sketch Engine prepares a ‘distributional thesaurus’ for a corpus. This is a thesaurus created on the basis of common collocation. If two words have many collocates in common, they will appear in each other’s thesaurus entry. It works as follows: if we find instances of both drink tea and drink coffee, that is one small piece of evidence that tea and coffee are similar. We can say that they ‘share’ the collocate drink (verb), in the OBJECT-OF relation. In a very large computation, for all pairs of words, we compute how many collocates they share, and the ones that share most (after normalisation) are the ones that appear in a word’s thesaurus entry. Distributional thesauruses are a topic of great interest in computational linguistics, and show promise for addressing a range of challenges.
The thesaurus entry for tea (in both list and word-cloud form) is shown in Fig. 10.
Users and uses
The first Sketch Engine users were lexicographers, with Macmillan as the first user for the word sketches,Footnote 8 and Oxford University Press as the first for the Sketch Engine.
Lexicography, particularly for English and particularly in the 1980s and 90s, was the driving force in the development of corpus methods and corpus use. Lexicography required very large corpora, so there was evidence even for rare words and phrases. At the time—pre-web or in the web’s infancy, pre “big data”—few others in linguistics or the language professions saw any great need for corpora. The English learners’ dictionaries had a vast and growing market, and were highly profitable, and were competing intensively with each other to produce ‘the best’ dictionary. This was fertile ground for innovation.
Lexicography has continued to be a core use for the Sketch Engine, with four of the five main dictionary publishers in the UK (Cambridge University Press, Harper Collins, Macmillan, Oxford University Press) using it intensively. At CUP and Macmillan, this is just for English; at Collins also for the main European languages, and at OUP also for large bilingual-dictionary projects for Arabic, Chinese and Portuguese.
In the UK, dictionary publishing is dominated by companies (and the commercial wings of University Presses); this is possible largely because there is a very large market. In many countries and for many languages, the curation of the national language is seen as a national project, and most lexicography takes place in academies and national institutes. They form a second group of users for the Sketch Engine. The Sketch Engine is in use at national institutes for Bulgarian, Czech, Dutch,Footnote 9 Estonian, Irish,Footnote 10 and Slovak.
The Sketch Engine has come out of the academic research world, and, naturally, many of its users are in universities. Within universities, the main kinds of use are
in linguistics and languages departments: teaching and research
in computing departments: teaching and research in relation to language technology (also called Natural Language Processing, Computational Linguistics). This is the home area of all Sketch Engine team members
discourse analysis: analyses of a particular kind of language for what it tells us about the attitudes, power relations and perspectives of the participants. This kind of work takes place in a range of departments in the humanities and social sciences. Recent examples include the analysis of British newspaper discourse on migrants and migration; portrayal of science in the news; knowledge dissemination through personal blogs.
The Sketch Engine is widely used for English Language Teaching and occasionally also for the teaching of other languages including Chinese, Japanese and Arabic.Footnote 11 The ‘Teaching and Language Corpora’ community has been exploring ways of bringing corpus methods into language-teaching practice since Tim Johns’ work in the 1980s. Johns worked in Birmingham, UK, alongside the COBUILD project for using corpora for lexicography, and the uses of corpora for ELT can be seen as having two parts: indirect use, in the preparation of dictionaries (and coursebooks), as covered above, and direct: in the classroom.
A first ELT coursebook based on the Sketch Engine has recently been published (Thomas 2014).
Countries where the Sketch Engine is widely used in ELT include China, the Czech Republic, Germany, Italy, Japan, Spain and Taiwan as well as the UK.
Translators find corpora (of specific domains) useful for identifying the terminology and phraseology of the domain, in the language they are translating into. (They will usually be a native speaker of that language, but will often not know the terms and turns of phrase for specialised areas in which they have a translation task). A number of professional translators are Sketch Engine users.
In the context of large organisations needing to prepare many documents in multiple languages, consistency is a challenge: in particular, the consistent use of the same term (within each language) for the same concept. It is good practice to develop and maintain a terminology, in which there is an entry for each of the concepts in a domain, with a specification of the term to be used in each language. One of the challenges for terminologists is finding the concepts and terms. The Sketch Engine can be used for term-finding (Kilgarriff 2013). This functionality has been developed in collaboration with the World Intellectual Property Organisation.
Language technology companies
A word list (with frequencies) for a language is a central resource for almost any language technology application, from speech recognition to spelling correction to text prediction. The corpora in the Sketch Engine provide the raw material, and the software can produce the word lists (and also many other lists: of n-grams, keywords, lemmas, terms) for many languages. Several technology companies have been users of this kind.
The Sketch Engine aims to cover all the large languages of the world, as well as any languages which particular users are asking for.
By a ‘large language’ we mean a language with a large number of speakers. The ethnologue website provides a list of languages sorted by numbers of speakers, as shown in Table 1.
The Sketch Engine has high-level resources for fifteen of these languages (as well as for many smaller ones) and basic resources for a further four. The languages not covered are Javanese (where there is a complex relationship to Bahasa Indonesia, a variety of Malay, for which there is a basic resource) and four of the languages of India and Pakistan (Lahnda/Punjabi, Marathi, Oriya and Urdu).Footnote 12
The prerequisite for a basic resource for a language, is simply, a corpus (plus segmentation tool where there are no spaces between words). A corpus can be collected from the web, using the Corpus Factory (Kilgarriff et al. 2010) or TenTen (Jakubíček et al. 2013) method.
For a high-level resource, further prerequisites are
a tokeniser (for Chinese and Japanese, usually called a segmenter) to identify the words. In simple cases this might just use spaces between words but many languages have clitics and similar needing language-specific treatment. English is a very simple language in this regard but even there, the hyphen and apostrophe characters, and mixtures of letters and non-letters, present challenges.
a lemmatiserFootnote 13
a part-of-speech tagger
a parser or ‘sketch grammar’.
What is also required is a collaborator. This is a person who speaks the language, and is ideally a computational linguist, and who cares about the quality of the output. They might care because they want to use the corpus in their own (or their group’s) work, or because they developed some of the tools and this is an opportunity to thoroughly test them and to show them (via data processed by them) to the world. The collaboration is crucial: without input from people who speak the language, the Sketch Engine team does not know if what it has done for a language is good. A collaborator is needed to point out mistakes and problems, which can then be addressed.
In the following sections we provide details about the status of Sketch Engine integration of various Asian languages.
The collaboration for Chinese began with Prof Huang Chu-Ren inviting the first author to Taiwan in 2004. (At the time, Huang was Deputy Director of the Linguistics Department at Academia Sinica, Taiwan.) Following that visit, and commercial interest in Chinese in the Sketch Engine from CJKI,Footnote 14 the Chinese Gigaword corpus was acquired from the Linguistic Data Consortium, segmented and part-of-speech-tagged at Academia Sinica using the tools developed there, and installed into the Sketch Engine. A sketch grammar was developed and word sketches were made available (Huang et al. 2005; Kilgarriff et al. 2005), as illustrated in Fig. 11. They have supported an extensive research programme since (e.g. Chung and Huang 2010; Huang et al. 2014 forthcoming).Footnote 15
The collaboration for Arabic is more recent, with the Centre for Computational Linguistics at Columbia University, USA (who prepared MADA + TOKAN, the leading tools for tokenisation, lemmatisation and POS-tagging of Arabic) and Arabic experts elsewhere in the USA. in Saudi Arabia and in the UK. Over a number of years we had received many expressions of interest regarding Arabic in the Sketch Engine. But the language presents a number of challenges:
there is modern standard Arabic (MSA: the language of the press, education, and officialdom, throughout the Arabic world), Classical Arabic, and the dialects. Most Arabic speakers speak largely their own dialect and are only occasional users of MSA. It is far from obvious what should be included in a corpus;
Arabic has many clitics, making tokenisation challenging;
Arabic is usually written without vowels;
Arabic has a complex morphological system, with a large share of the vocabulary being the result of derivations according to semi-productive processes. A central issue in Arabic lexicography is whether entry should be based on stems (the traditional approach, giving a smaller number of longer entries) or lemmas (which are closer to dictionary headwords in an English or French dictionary).
Other Asian languages
For Turkish there were open-source tools available, including a parser, so a Turkish web corpus was processed with that, and the dependency relations which were the output of the parser were used directly to form word sketches (Ambati et al. 2012), see Fig. 13.
For Persian, a very large corpus which had been prepared and parsed at Carnegie Mellon University, USA, was loaded into the Sketch Engine.Footnote 16
There are also resources for Azerbaijani, Bengali, Hebrew, Indonesian, Kazakh, Korean, Kyrgyz, Malay, Malayalam, Tajik, Tamil, Telugu, Thai, Tibetan (Garrett et al. 2014 forthcoming), Turkmen, and Uzbek (Baisa and Suchomel 2012).
“Corpora for all” is the Sketch Engine company tagline: here we give a brief survey of the range of corpora in the Sketch Engine.
Corpora in the Sketch Engine are either owned and managed by the Sketch Engine Team (‘preloaded’ corpora), or are user corpora, owned and managed by the user.
The primary goal is to provide, for each language, a large, recent, general language corpus for the language, processed with high-quality tools for the language, with word sketches, and checked extensively by one or more collaborators. These corpora are for lexicography and general language research, for example into the syntax or morphology of the language. ‘Large’ means at least 50 million words, and for recent work with large languages, several billion. In most cases these are web corpora, as the web is the only place to get material in vast quantity and covering a wide range of text types and domains. In some cases, for example Estonian or Irish, where there is collaboration with an organisation which has gathered a large corpus using some other method, we have combined web-sourced and other material.
These corpora can be kept up-to-date by crawling again and adding new material.
There are large, general-language corpora for sixty languages.
One central language task is translation. For that, a key resource is the parallel corpus, comprising sets of texts which are translations of each other (or, are both translations of the same source). Parallel concordancing, as in Fig. 17, is where a user inputs a search term in one language and sees pairs of sentence: those with the matching term in the first language, and the corresponding sentence in the target language.
In the Sketch Engine there are data for 300 language pairs. These data are from two main sources: EUROPARL and OPUS. EUROPARL comprises speeches made at the European parliament, which have been translated into 21 official European Union languages (Koehn 2005).Footnote 17 The OPUS data, a collection of parallel corpora collected in the OPUS project, are made available on its website.Footnote 18 It comprises many different parts, two of the largest (for most language pairs) being documents from the United Nations, and Open Subtitles.Footnote 19 Figure 17 shows subtitle data.
In addition to all of the functionality shown so far, an extra option for parallel data is to search simultaneously in both languages: Fig. 17 shows the output when is searched on the Chinese side, smile on the English.
Second/foreign language learning and teaching
In the context of language learning, two central questions arise:
what are learners saying and writing?
what should they be saying and writing?
For the first, there are learner corpora.Footnote 20 Learner corpora are valuable for finding out what learners, at various levels, do, and for research into the process of language learning as well as the practicalities of curricula, course development, and testing. In the Sketch Engine there are learner corpora for Slovene, Czech and English.Footnote 21
For the second, the general answer is “the language”, and general language corpora meet that need. But there is also a more specific answer: one large population of language learners are learning English, and would like to study at an English-medium university. Thus their target is the English that is spoken in seminars and written in University-level essays, by accomplished English speakers. The British Academic Spoken English (BASE) and British Academic Written English (BAWE) corpora have been created as samples of these target varieties.Footnote 22
A central topic for linguists is language development and change. Corpora looking back over the history of a language, and supporting this kind of research, include LatinISE (of Latin from the third century B. C. to the twentieth century A.D., McGillivray and Kilgarriff 2013), GermanC (of German from the seventeenth and eighteenth centuries; Scheible et al. 2011) and English Dialogues Corpus (sixteenth–eighteenth centuries; Culpeper and Kytö 2010).
For the Arabic world and Islam, the language has a special role. It is the language of the Quran and of the culture that the region shares. The different countries each have their own dialect, and the lingua franca, MSA, is closer to classical Arabic than to the dialects. The King Saud University Corpus of Classical Arabic (KSUCCAFootnote 23) brings together many of the central texts of this language, culture and religion, including the Quran and the Hadith.
Learning to speak
Since 1984, the CHILDES and Talkbank projects, based at Carnegie Mellon University, have been gathering child–adult conversations.Footnote 24 They are largely between babies and young children and their carers (with many of the carers being linguists, who have taken on the recording and transcription of the data). All are available as transcripts, and many also as audio or video. The data can be explored on the Talkbank website as well as the Sketch Engine: the two websites are complementary, with Talkbank expecting the user to be a developmental or general linguist, and the Sketch Engine expecting them to have a corpus orientation. There is a CHILDES corpus in the Sketch Engine for 22 languages, varying in size from a few thousand words to, for English, 23 million.
Learning to read and write
Educators, children’s authors and publishers, and linguists and psychologists studying the process of learning to read, are interested in the language that schoolchildren read and write. So are producers of children’s dictionaries. The Education division of Oxford University Press has created the Oxford Children’s Corpus (Wild et al. 2013), comprising both material written for children (largely stories, many being titles published by OUP) and stories written by children. This second part resulted from a competition led by the top UK disc jockey Chris Evans, who, from his show on BBC Radio 2, invited children to write a 500-word story and send it into him. In 2014, 115,000 British children did so. The BBC then made the data available to OUP for linguistic research.Footnote 25
The size of the corpus, as at April 2014, is 115 million words.
The Brown corpus was central to the development of corpus linguistics. It was one million words, comprising five hundred 2,000-word samples from 13 different genres, all of American English published in 1961. It has played a huge role as a point of reference ever since and has spawned ‘Brown family’ corpora for British and American English, at a number of time points (see Fig. 5).
Another key reference corpus for English is the British National Corpus (Burnard 1995).
Sociolinguists are interested in how language varies between social groups, across age groups, with movements of populations, and between communities. A corpus designed to study these topics is the London English corpus (Kerswill et al. 2013).
As well as preloaded corpora (managed by the Sketch Engine team) users can upload, build, process, share and explore their own corpora.
Where a user already has a corpus, they can upload it and install it, via a simple web interface. The source documents can be in any of the common formats (doc, html, pdf, txt, tmx) and may also be compressed and/or archived (.zip, .gz, .bz2, .tar). All of these formats are then converted to plain text (.txt). If the data are already annotated (perhaps with part-of-speech tags, or lemmas, or for discourse function, etc.) then it needs to be in the Sketch Engine’s input format, ‘vertical’ text, as documented in the Sketch Engine help pages. The user can then manage their own corpora, including adding more data, deleting, and processing (see below) as well as using them for their research via the core Sketch Engine functions (as in Sect. 2).
BootCaT (Baroni and Bernardini 2004) is a procedure for building a corpus, starting from a set of ‘seed words’, by making tuples (typically triples) of the seed words, sending each tuple as a query to a search engine, and then gathering the web pages that the search engine finds. When applied to a specialist domain, with seed words from that specialist domain, it turns out to be a remarkably efficient way of discovering the terminology and phraseology of that domain. The Sketch Engine includes an implementation called WebBootCaT. See Fig. 18 for the WebBootCaT form and Fig. 19 for the keywords and terms found, fully automatically and in a few minutes, for the vulcanology domain.
More can be done with a user corpus if it is accurately tokenised, lemmatised and part-of-speech tagged. Tools for these processes are language-specific. For 11 major world languages, tools have been identified, licenced (where necessary) and installed. Some users have used the Sketch Engine explicitly for this service: they can upload plain text, get it processed, and then download the processed data (in vertical format), perhaps for further annotation and re-uploading.
Processing options are steadily being added for more languages.
There is ‘access control’ for user corpora. By default, the only user who can see a corpus is the person who created it, but they can give access to others (at a web interface). This allows a teacher to give access to a corpus that they have prepared, to their students, or a researcher to share their corpus with their colleagues.
At ten years old, the Sketch Engine is mature software. There has been a steady stream of new functionality as well as bug-fixes and improvements to usability. Recent usability improvements include
a ‘breadcrumb trail’ to show a user how they got to the concordance they are looking at, which might be the results of an original search plus sorting, filtering and sampling. This was a response to user feedback that it was easy to lose track of what a particular concordance was
‘more data’ and ‘less data’ buttons for work sketches. The number of collocates shown in a word sketch is defined by three parameters: a frequency threshold, a salience threshold and a limit to the number of collocates per list. Users sometimes want to see more collocates than they have been shown in the first instance—and sometimes they feel overwhelmed and want to see less. But if they start considering the three parameters, they are bamboozled. Hence the ‘more data’ and ‘less data’ buttons
thesaurus word clouds: see Fig. 9.
Other additions to functionality, which complement the core functions described in Sect. 2 and the preloaded and user corpora as described in Sect. 5, include the API, GDEX, bilingual sketches, keywords and ‘comparing corpora’, terminology, and localisation.
A simple JSON API allows other programmes to access word sketches, collocations, thesaurus entries and to find the terminology in a document.
Dictionary users like examples. This is a clear finding of dictionary user research (Frankenberg Garcia 2014). Where the dictionary is to be published on paper, not many examples can be offered owing to space limitations. With electronic dictionaries, that constraint disappears. The constraint becomes, rather the editorial time needed to prepare them. There are already compelling linguistic reasons for taking examples from corpora rather than inventing them (Hanks 2012): could the corpus software not merely find the examples for a word, but automatically find the good ones, for using as dictionary examples?
A Good Dictionary Examples (GDEX) function was added to the Sketch Engine in 2008 and has had many enthusiastic users. It was originally applied to English (Kilgarriff et al. 2008) and has since been used for a number of languages including Slovene (Kosem et al. 2011). It works by sorting a concordance, so the corpus lines judged best by the algorithm are shown first. Then the lexicographer should not have to read many of them before finding a good one. The same core technique has also been used to score documents and to exclude low-scoring documents from a corpus entirely.
The critical outstanding question, for dictionary publishers, is this: can GDEX work well enough, so that example sentences can be added to dictionary entries without an editor needing to check them first? This is a goal, but a number of obstacles stand in the way. First the corpus needs to be very big, to provide plenty of examples for the algorithm to choose amongst, and usually the only way to get a very large corpus is from the web. But web corpora contain web spam, which sometimes makes it past all other filters and makes bad dictionary examples.
Second: parsnips. Parsnips is an acronym for the potentially offensive topics which teaching materials, which will be seen across the globe, by all communities and cultures, might be wise to avoid. It stands for Politics Alcohol Religion Sex Narcotics Isms Pork (as a stand-in for various foods which are taboo in various cultures). A second current challenge is to scrub the data clean of parsnips.
Where monolingual lexicographers appreciate monolingual word sketches, bilingual lexicographers would like bilingual ones. They have recently been developed (Baisa et al. 2014) and are currently being rolled out for more language pairs.
Keywords and corpus comparison
Where there are a number of corpora available for a language, the question arises, “how do they compare?” This has been the central research question for the first author for some years (Kilgarriff 2001, 2012) and the Sketch Engine supports a range of comparisons, quantitative and qualitative, between any pair of same-language corpora: see Kilgarriff (2012) for English and Czech, Kilgarriff and Renau (2013) for Spanish.
To find the terminology of a domain, in a language, the requirements are
a domain corpus
a reference corpus
a grammar for terms
a lemmatiser, part-of-speech tagger, and parser (to find linguistic units with the grammatical shape that makes it possible for them to be terms)
a statistic (to identify the term candidates that are most distinctive of the domain in contrast to the reference material).
The Sketch Engine has most of these pieces in place. Users can upload their domain corpus, or build one using WebBootCaT. Reference corpora are available for 60 languages. There are already grammars for the word sketches, which can be adapted to provide term grammars. The parsing machinery is in place, and, as discussed above, for a growing number of languages, language-specific processing tools are installed and ready to use. The statistic used to identify keywords is also suitable for identifying terms.
Term-finding functionality is now available in the Sketch Engine, as illustrated in Fig. 19, for ten languages.
Machinery to support the localisation of the interface has been added. Currently, the interface can be seen in Czech, Chinese, English, Irish, Slovene, and Croatian.
The Sketch Engine is both a corpus query tool and a web service; the web service includes corpus building and management. We take each in turn.
Corpus query tools
Software tools for corpus exploration fall into two categories: those designed for installation on each computer where they are used, and those designed for installation on a server.
Widely used tools for local installation include (in order of their invention), Monoconc/Paraconc (since 1995), WordSmith (since 1996), Antconc (Anthony 2004; since 2004) and Concgram (Greaves 2009; since 2005). Antconc is free, and the other three are commercial products. All have many enthusiastic users. All can be used over a network, but this is not their normal mode.
Amongst tools for use over a network, the IMS Corpus Workbench has pride of place. IMS is the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart, where the tool was developed in the early 1990s (Christ and Schulze 1994). (It is also often referred to as CWB, Stuttgart tools, or CQP, for its “Corpus Query Processor”.) It has been widely used and has a community of developers working with it. The original version was pre-web, and the envisaged network was within a University. A central question since is as to how it can be made to work well on the web. The usual solution has been that it provides a back end, and then a number of front ends have been prepared. CQPWeb (Hardie 2012), for example, combines the IMS Corpus Workbench back end with a MySQL database.
The Stuttgart tools and CQPWeb are both free and open-source, and the community of developers for corpus software has a strong commitment to open source. While the Sketch Engine is not open source, as this could undermine its viability as a business, a version of it, NoSketchEngine, is open source.Footnote 26
The functionality of all of these tools (and most of those covered below) comprises a concordancer, plus various ways to manipulate concordances, plus a range of summary reports. There is little disagreement about the value of the various reports, and the functionality differences lie rather in how much time and motivation the developers have had to develop more functions. As one of the more mature tools, working commercially with a support and development team of seven, the Sketch Engine has more functions than most.
Corpus websites and services
There are a small number of corpus websites for multiple languages and a large number for a single language (and usually, a single corpus). We review those that cover multiple languages in some detail below. We do not cover the single-language ones: there are too many, many of which are short-lived. It is de rigeur for any corpus project to make its corpus available over the web, and this is typically done on a dedicated website, sometimes using the Stuttgart tools as the back end and sometimes using software developed as part of the project. Such projects are often national projects, and one advantage is often that the interface is in the same language that the corpus is a corpus of. For scholars of that language who may not be at ease in English, this may be a major advantage. A disadvantage of developing software afresh is that the software will be new: it is likely to be less robust, with less functionality, than mature systems. The funding will end and then it will be hard to maintain. Growing numbers of corpus developers are taking the route of making their corpus available in the Sketch Engine.Footnote 27
Corpus websites for multiple languages
Mark Davies’s website at Brigham Young UniversityFootnote 28 offers corpora for English, Spanish and Portuguese. The resources for English are outstanding, supporting the exploration of the behaviour of words and phrases across time, genre, and regional varieties (Davies 2009). The system is fast and reliable.
Uwe Quasthoff and colleagues in Leipzig have crawled the web for corpora of 229 languages and made them searchable at their Worschatz website (Quasthoff et al. 2006).Footnote 29 The website is in German. Within Germany, this is a very widely used site: it serves as a main reference for language questions from laypeople.
Eckhard Bick has focussed on syntax and parsing. The Visual Interactive Syntax Learning websiteFootnote 30 has corpora which are often modest in size but are parsed. The website has games and quizzes to support language education, as well as Deepdict, comprising word-sketch-like reports, for nine languages (Bick 2009).
The OPUS project (Tiedemann and Nygaard 2004) has gathered parallel corpora and organised them so they are both searchable on the website (with the Stuttgart tools as the back end) and also downloadable, for use in other research and software (including the Sketch Engine; many of the parallel corpora in the Sketch Engine were taken from the OPUS site).Footnote 31
All of these sites are free to use. This is in contrast to the Sketch Engine. Most are based in Universities and are supported via research grants and academic salaries. While, naturally, most people would rather not pay (other than via their taxes), the commercial model has advantages. There is an income stream to support the maintenance and development of the software and web service for the long term, and customers with particular requirements can get what they need, by paying for it.
Google and other search engines do a similar job to a corpus website: they allow the user to find many instances of a word, in context, as a dataset for further study, and they do it fast. Where the user knows of no corpus for the language, or the item they are searching for is rare so not enough data are available via dedicated corpus linguistic tools, Google may be the best tool to use. For a discussion of the use of search engines for corpus research see Kilgarriff (2007).
A possibility lying between the search engine and a corpus tool is the metasearch engine, in which a corpus tool takes a user’s query, passes it on to Google or another search engine, receives the results, and filters and displays them in ways that are useful for language researchers. The best known tool of this kind is Webcorp (Renouf et al. 2006).Footnote 33
Other corpus-like websites, mentioned here for completeness, are:
Wikipedia : the wikipedia for a language is a convenient corpus for that language, as used, for example, as a starter corpus in Kilgarriff et al. (2010).
Project GutenbergFootnote 34
Google booksFootnote 35
Tools for corpus building and annotation
The BootCaT procedure is described above. There are a number of implementations, including one from the University of Bologna group where the idea was originally developed.Footnote 40
Several groups have developed pipelines for web corpus building. The steps are
‘cleaning’ to remove non-text material
linguistic processing (tokenisation, possibly also lemmatisation, part-of-speech tagging, parsing).
The pipeline used by the Sketch Engine team uses three tools which were developed within the group and have now been published as open-source software: spiderling (Suchomel and Pomikalek 2012) for crawling, onion for deduplication, justext for cleaning (both Pomikálek 2011). Other pipelines, with similar philosophy and components, have been developed in Bologna (Baroni et al. 2009), Leipzig (Biemann et al. 2004) and Berlin (Schäfer and Bildhauer 2013).
Annotating a corpus with human input (as distinct from a fully automatic process) is supported in a limited way in the Sketch Engine, via facilities developed for Hanks’s Corpus Pattern Analysis (Hanks 2008).Footnote 41 Many tools have been developed specifically for manual and semi-automatic corpus annotation, leading examples being the UAM tool (O’Donnell 2008) and the Groningen Meaning Bank tool (Basile et al. 2012).Footnote 42
The Sketch Engine is a leading corpus tool (both in the sense of ‘corpus query tool’ and in the sense of ‘corpus web service’). It is now 10 years old: a 10-year period that has seen revolutions in connectivity, devices, and dictionary publishing, and the worldwide spread of corpus methods in dictionary-making. It is mature software offering a wide range of functions, with the web service offering many corpora for many languages, as well as services for corpus building and maintenance.
In this paper we have described word sketches, concordancing, and the thesaurus (Sect. 2), the different kinds of user (Sect. 3), and approaches to working with many different languages (Sect. 4). Section 5 reviewed the kinds of corpora available in the Sketch Engine, including user corpora and the ways of building and working with them. Section 6 gives a brief tour of some of the innovations and new reports offered in the past few years. In Sect. 7 we reviewed related work.
As the strapline has it, ‘corpora for all!’
The change is often lamented. Rundell (2012) celebrates it.
The Oxford Arabic Dictionary is a bilingual Arabic-English, English-Arabic dictionary. For bilingual dictionaries, a corpus is most relevant for the analysis of the source language. In this dictionary the source side of the Arabic-English half is the first corpus-based, dictionary-scale analysis of the Arabic lexicon.
The examples in this section are all in English, as it is the only language that most readers of the journal share. Later sections will discuss and give examples for a range of Asian languages.
Fisher is a gender-neutral variant of fisherman (in addition to its uses in compounds such as scallop fisher, bottom fisher, the Fisher King, in the biblical fisher of men, and as a common English surname).
For languages which do not put spaces between words (Chinese, Japanese, Thai), segmentation into words is both a prerequisite for high-quality concordancing, and also a user interface challenge. In the Sketch Engine all corpora for these languages have been pre-processed with language-specific tools to segment the input string into words.
Corpus query language (CQL) is based on the formalism developed at University of Stuttgart in the 1990s (Christ and Schulze 1994) and widely used in the corpus linguistics community. The Sketch Engine version has been extended (Jakubíček et al. 2010) and is fully documented, with a tutorial, on the website.
Most databases are based on relational modelling and queried using Structured Query Language (SQL). However, for text, the sequence of words is the central fact, not the relations, so it is debatable whether SQL databases are suitable. The Sketch Engine is based on its own database management system called Manatee (Rychlý 2000, 2007) devised specifically for corpus linguistics. The web-based front end of Manatee is called Bonito and together with the Corpus Architect module (responsible for building and managing user corpora) these three are the core components of the Sketch Engine.
This was in 1998, for the preparation of the first edition of the Macmillan English Dictionary (Rundell 2002) in a process described in Kilgarriff and Rundell (2002). Thus word sketches are older than the Sketch Engine. The first versions of word sketches were standalone HTML files, one for each word. The integration with a full-function corpus query tool, Manatee/Bonito, to give The Sketch Engine, came later.
Dutch (also called Flemish) is an official language in both the Netherlands and Belgium, and the institute here (INL) is a joint one from both countries.
Much of the development work for the Sketch Engine was undertaken under a contract from Foras na Gaeilge (the official body for the Irish language) in preparation for the creation of a new English-Irish dictionary (http://www.focloir.ie). Irish is spoken in both the Irish Republic and Northern Ireland (which is part of the UK) and Foras na Gaeilge is a joint institute of both countries.
Much of this work takes place in universities, but much also takes place outside (e. g., in language schools) so we treat it as a separate type of use.
For Urdu, the relation between the language spoken in India (as a mother tongue) and in Pakistan (as an official language but not the mother tongue of many people) is a particular puzzle. For both Urdu and Punjabi, multiple writing systems are a challenge. For all Indian languages, until quite recently, many Indians would use the web in English: all educated Indians speak English, web searching worked better in English, and not so much could be found in many of the Indian languages. This is changing fast.
For all of the languages that have been considered except Chinese, where the concept does not apply.
The resources for Chinese have since been updated, and now the leading Chinese corpus in the Sketch Engine is zhtenTen, from the web, and is processed by Stanford tools. The collaboration with Prof. Huang continues.
The original release was for fewer languages. For recent releases see http://www.statmt.org/europarl/.
The Cambridge Learner Corpus, the largest learner corpus for English, is in the Sketch Engine and is used extensively by Cambridge University Press and researchers and textbook authors who publish with them. However it is not accessible to SkE users without a CUP affiliation.
See http://global.oup.com/uk/pressreleases/500words/. Access to the corpus is restricted to OUP staff and collaborators. Word lists based on the corpus may be made available; applications to OUP.
NoSketchEngine comprises current versions of Manatee and Bonito, but without word sketches and the functionalities built on them. See http://nlp.fi.muni.cz/trac/noske.
This is the history of most of the Sketch Engine interface localisations.
Ambati, B.R., S. Reddy, and A. Kilgarriff. 2012. Word sketches for Turkish. In Proc LREC, 2945–2950. Istanbul.
Anthony, L. 2004. AntConc: a learner and classroom friendly, multi-platform corpus analysis toolkit. In Proc IWLeL, 7–13.
Arts. T., ed. 2014. Oxford Arabic Dictionary. Oxford: Oxford University Press.
Arts, T., Y. Belinkov, N. Habash, A. Kilgarriff, and V. Suchomel. 2014 (forthcoming). arTenTen and word sketches for Arabic. Journal of King Saud University: Computing and Information Science. Special issue on Arabic natural language processing.
Baisa, V., M. Jakubíček, A. Kilgarriff, V. Kovář, and P. Rychlý. 2014. Bilingual word sketches: the translate button. In Proc EURALEX, Bolzano/Bozen
Baisa, V., and V. Suchomel. 2012. Large corpora for Turkic languages and unsupervised morphological analysis. In Proc LREC, Istanbul
Basile, V., J. Bos, K. Evang, and N. Venhuizen. 2012. Developing a large semantically annotated corpus. In LREC vol. 12, 3196–3200.
Baroni, M., and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proc LREC, Lisbon
Baroni, M., S. Bernardini, A. Ferraresi, and E. Zanchetta. 2009. The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.
Bick, E. 2009. DeepDict—a graphical corpus-based dictionary of word relations. In Proc NODALIDA, Vol. 4, 268–271.
Biemann, C., S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. In Computational linguistics and intelligent text processing, 217–228. Berlin Heidelberg: Springer.
Burnard, L. 1995. The BNC reference manual.
Christ, O., and M. Schulze. 1994. The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual. University of Stuttgart.
Chung, S.-F., and C.-R. Huang. 2010. Using collocations to establish the source domains of conceptual metaphors. Journal of Chinese Linguistics 38(2): 183–223.
Culpeper, J., and M. Kytö. 2010. Early Modern English dialogues: spoken interaction as writing. Cambridge: Cambridge University Press.
Davies, M. 2009. The 385+ million word Corpus of Contemporary American English (1990–2008+): design, architecture, and linguistic insights. International Journal of Corpus Linguistics 14(2): 159–190.
Frankenberg Garcia, A. 2014. The use of corpus examples for language comprehension and production. ReCALL.
Garrett, E., N.W. Hill, A. Kilgarriff, R. Vadlapudi, and A. Zadoks. 2014. The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries. In The Third International Conference on Tibetan Language, eds. Tuttle, Gya, Dare and Wilber, New York: Trace Foundation (forthcoming).
Greaves, C. 2009. ConcGram 1.0: a phraseological search engine. John Benjamins.
Hanks, P. 2008. Mapping meaning onto use: a Pattern Dictionary of English Verbs. In Proc AACL, Utah.
Hanks, P. 2012. The corpus revolution in lexicography. International Journal of Lexicography 25(4): 398–436.
Hardie, A. 2012. CQPweb—combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics 17(3): 380–409.
Hà, P.T., N.T.M. Huyền, L.H. Phương, and A. Kilgarriff. 2012. Nghiên cứu từ vựng tiếng Việt với hệ thống sketch engine. Tạp chí Tin học và Điều khiển học 27(3): 206–218.
Huang, C.-R., K.-J. Chen, and Q.-X. Lai. 1997. Mandarin Daily Dictionary of Chinese Classifiers. Taipei: Mandarin Daily Press.
Huang, C.-R., J.-F. Hong, W.-Y. Ma, and P. Šimon. 2014. From corpus to grammar: automatic extraction of grammatical relations from annotated corpus. In T’sou and Kwong Eds. Linguistic Corpus and Corpus Linguistics in the Chinese Context. Journal of Chinese Linguistics Monograph. Hong Kong: Chinese University of Hong Kong Press, (forthcoming).
Huang, C-R., A. Kilgarriff, Y. Wu, C.M. Chiu, S. Smith, P. Rychly, M.H. Bai, and K.-J. Chen. 2005. Chinese Sketch Engine and the extraction of grammatical collocations. In Proc Fourth SIGHAN Workshop on Chinese Language Processing, 48–55.
Jakubíček, M., A. Kilgarriff, D. McCarthy, and P. Rychlý. 2010. Fast syntactic searching in very large corpora for many languages. In Proc PACLIC, Vol. 24, 741–747, Japan.
Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý, and V. Suchomel. 2013. The TenTen corpus family. Lancaster: In Proc. Int. Conf. on Corpus Linguistics.
Kerswill, P., J. Cheshire, S. Fox, and E. Torgersen. 2013. English as a contact language: the role of children and adolescents. In English as a Contact Language, 258.
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1): 97–133.
Kilgarriff, A. 2007. Googleology is bad science. Computational linguistics 33(1): 147–151.
Kilgarriff, A. 2012. Getting to know your corpus. In Text, Speech and Dialogue, 3–15. Berlin Heidelberg: Springer.
Kilgarriff, A. 2013. Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proc ASLIB 35th Translating and the Computer Conference, London.
Kilgarriff, A., and M. Rundell. 2002. Lexical Profiling Software and its lexicographic applications: a case study. In Proc EURALEX. Copenhagen, Denmark.
Kilgarriff, A., P. Rychlý, P. Smrz, and D. Tugwell. 2004. The Sketch Engine. In Proc Eleventh EURALEX International Congress. Lorient, France.
Kilgarriff, A., C.R. Huang, P. Rychlý, S. Smith, and D. Tugwell. 2005. Chinese word sketches. In Proc ASIALEX 2005: Words in Asian cultural context. Singapore.
Kilgarriff, A., M. Husák, K. McAdam, M. Rundell, and P. Rychlý. 2008. GDEX: automatically finding good dictionary examples in a corpus. In Proc. Euralex. Barcelona
Kilgarriff, A., and I. Renau. 2013. esTenTen, a Vast Web Corpus of Peninsular and American Spanish. Procedia Social and Behavioral Sciences 95: 12–19.
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. Proc MT summit 5: 79–86.
Kosem, I., M. Husak, and D. McCarthy. 2011. GDEX for Slovene. In Proceedings of eLex, 151–159. Bled, Slovenia.
Kosem, I., V. Baisa, V. Kovář, and A. Kilgarriff. 2013. User-friendly interface of error/correction-annotated corpus for both teachers and researchers. Solstrand: Proc Learner Corpus Research.
McGillivray, B., and A. Kilgarriff. 2013. Tools for historical corpus research, and a corpus of Latin. In New Methods in Historical Corpora, Bennett, P.D. ed. Vol 3. BoD–books on demand.
O’Donnell, M. 2008. Demonstration of the UAM CorpusTool for text and image annotation. In Proc 46th ACL: Demo Session, 13–16. Association for computational linguistics.
Pomikálek, J. 2011. Removing boilerplate and duplicate content from Web Corpora. PhD thesis, Masaryk University, Brno, Czech Republic.
Quasthoff, U., M. Richter, and C. Biemann. 2006. Corpus portal for search in monolingual corpora. In Proc LREC, 1799–1802. Genoa, Italy.
Renouf, A., A. Kehoe, and J. Banerjee. 2006. WebCorp: an integrated system for web text search. Language and Computers 59(1): 47–67.
Rundell, M. ed. 2002. Macmillan English Dictionary for Advanced Learners. Macmillan.
Rundell, M. 2012. Stop the presses—the end of the printed dictionary. Macmillan Dictionary Blog, 5 Nov. http://www.macmillandictionaryblog.com/bye-print-dictionary.
Rychlý, P. 2000. Korpusové manažery a ~ jejich efektivní implementace.Rychlý. PhD Thesis, Masaryk University, Brno, Czech Republic.
Rychlý, P. 2007. Manatee/bonito–a modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70. Masaryk University, Brno, Czech Republic.
Sanseido. 2003, 2007. The WISDOM English–Japanese Dictionary. Sanseido.
Schäfer, R., and F. Bildhauer. 2013. Web corpus construction. Synthesis Lectures on Human Language Technologies 6(4): 1–145.
Scheible, S., R.J. Whitt, M. Durrell, and P. Bennett. 2011. A gold standard corpus of Early Modern German. In Proc 5th Linguistic Annotation Workshop, 124–128. Association for computational linguistics.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. Baroni and Bernardini, 63–98. Bologna: Gedit.
Srdanovic Erjavecs, I., Erjavec, T., and Kilgarriff, A. 2008. A web corpus and word sketches for Japanese. Information and Media Technologies, 3(3).
Suchomel, V., and J. Pomikálek. 2012. Efficient Web crawling for large text corpora. In Proc Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.
Thomas, J. 2014. Discovering English with the Sketch Engine. Print-on-demand. http://ske.li/deske.
Tiedemann, J., and L. Nygaard. 2004. The OPUS Corpus—parallel and free. Lisbon: Proc LREC.
Wild, K., A. Kilgarriff, and D. Tugwell. 2013. The Oxford Children’s Corpus: using a Children’s Corpus in Lexicography. International Journal of Lexicography 26(2): 190–218.
This work has been partly supported by the Ministry of the Interior of the Czech Republic, in project VF20102014003.
Communicated by Yukio Tono.
About this article
Cite this article
Kilgarriff, A., Baisa, V., Bušta, J. et al. The Sketch Engine: ten years on. Lexicography ASIALEX 1, 7–36 (2014). https://doi.org/10.1007/s40607-014-0009-9
- Corpus lexicography
- Corpus tools
- Word sketches
- Sketch Engine