1 Introduction

The Sketch Engine is a leading corpus tool. It has been widely used in lexicography. It is now 10 years since its launch (Kilgarriff et al. 2004).

Those 10 years have seen dramatic changes. They have seen the near-death of dictionaries on paper, at the hands of electronic dictionaries.Footnote 1 They have seen the emergence of entire new ecosystems of dictionaries on the web, with many new players (Google, weblio.jp, dictionary.com, Leo, Wordnik.com). Previously, the dominant players had been around for decades, even centuries—Longman (who published Johnson’s dictionary in 1754), Kenkyusha, OUP, Le Robert, Duden, Merriam-Webster.

In the world at large, we have seen the invention and world takeover of the smartphone. 1994–2004 saw the switch of most dictionary lookups from paper to electronic: 2004–2014 has seen them nearly all (in percentage terms) switch from computer to phone. (Just think how often your students look up words on their phones, versus how often they look them up in any other way.) Dictionaries are far, far more available and accessible than they were. The sheer number of dictionary lookups has risen many times over (even as—bitter irony—many dictionary companies have seen their income collapse).

This is all at the publishing end of the dictionary business. What about the lexicography end? Here, we have seen the corpus revolution (Hanks 2012). It started in Northern Europe in the 1980s and 1990s, and has been spreading. For Chinese, a first thoroughly corpus-based dictionary was probably Huang et al. (1997)’s classifier-noun collocation dictionary. For Arabic, it is Oxford University Press’s Oxford Arabic Dictionary (Arts 2014),Footnote 2 though this was not produced in Asia. In Japan, corpus lexicography started in bilingual dictionary projects such as the WISDOM English-Japanese Dictionary (Sanseido 2003, 2007), but a truly corpus-based monolingual dictionary of Japanese is yet to appear.

Thus the 10 years of the Sketch Engine have also been the 10 years of bringing corpora into Asian lexicography. The paper is a perspective on those changes.

In this paper we review

  • the tool

  • its users

  • the languages covered

  • the corpora accessible in it, and

  • developments in the software over the past decade.

We finish by reviewing related work: other corpora, corpus websites and corpus tools as available for lexicography and corpus linguistics.

‘Sketch Engine’ refers to two different things: the software, and the web service. The web service includes, as well as the core software, a large number of corpora pre-loaded and ‘ready for use’, and tools for creating, installing and managing your own corpora. The paper covers both, with Sects. 2 and 6 focussing on the software, 3, 4, and 5, the web service.

2 The Sketch Engine software: core functions

2.1 The word sketch

The function that gives the Sketch Engine its name is the word sketch: a one-page summary of a word’s grammatical and collocational behaviour (Fig. 1).Footnote 3

Fig. 1
figure 1

Word sketch for English catch, verb (from corpus enTenTen12)

This is a feast of information on the word. For catch (verb) just looking at the first column (objects of the verb) we immediately see a number of meanings, idioms and set phrases. We catch a glimpse of or catch sight of something. Fisherman, fishersFootnote 4 and anglers (column 2) catch fish, trout and bass. You often want to catch someone’s attention. You sometimes catch your breath and things sometimes catch your eye. Sportsmen and women, in a range of sports, catch passes and balls. Things catch fire. We all sometimes catch buses.

The ‘object’ column is noise-free, and all items on it are immediately interpretable by a native speaker. The second column, for subject, introduces a couple of complications. Surprise relates to the expression caught by surprise. Eye and breath are objects misanalysed as subjects. Touchdown catches is a term from American football: the word sketch succeeds in bringing it to our attention, though catches is a noun which has been misanalysed as a verb. Police introduces a new meaning of the verb (police catch criminals) and Anyone brings to our attention the related pattern Anyone caught [doing X] will be [punished].

The third column, and/or, tells us more about the police and sports meanings. Overheat goes with catch fire. Tangle and snag introduce a new meaning where, if a rope or line or piece of cotton or string or wire catches with something else, it no longer runs free.

The fourth table brings our attention to the phrasal verbs catch up, catch on, catch out; the fifth, to the reflexive use (I caught myself wondering…). The next set of tables show us what we might be caught in (the crossfire, a trap, the headlights), on (videotape, CCTV), by and with (your pants down). The final column takes us back to the police, with people being caught red-handed and unprepared.

The word sketch can be seen as a draft dictionary entry. The system has worked its way through the corpus to find all the recurring patterns for the word and has organised them, ready for the lexicographer to edit, elucidate, and publish. This is how word sketches have been used since they were first produced.

2.2 Concordance

When looking at a word sketch, a user often wants to find out more: where and how, for example, was catch used with with and pant? They can do this by clicking on the number, and seeing the concordance, as in Fig. 2.

Fig. 2
figure 2

Concordance for caught with pants

This is usually enough to show why a collocate occurred in a word sketch.

The concordance is the basic tool for anyone working with a corpus. It shows you what is in your corpus. It takes you to the raw data, underlying any analysis. Getting there from a word sketch is just one of the ways of getting to a concordance. The basic method is from the simple search form, as in Fig. 3.

Fig. 3
figure 3

Simple search form

This is modelled on the Google search form. Users like a simple input form, where they put in what they are looking for, and the tool finds it for them. It is for the tool to do its best to understand what it was that the user wanted, and to find it for them. In the case of the Sketch Engine, simple searches are interpreted as follows:

  • case-insensitive (so a search for catch finds catch, Catch and CATCH)

  • as searches for either word form or lemma (so a search for catch finds catch, catching, catches, caught, and a search for caught finds just caught), and

  • where there is more than one item (with space as separator), a sequence.Footnote 5

These three aspects combine, so a simple search for catch fire finds all the hits in Fig. 4.

Fig. 4
figure 4

Search hits for simple search catch fire

Users often want more control than the simple search offers. By clicking on ‘Query types’ they see the options as in Fig. 5, and can specify a lemma (with optional word class, e.g. verb, noun, adjective) or a specific phrase or word form (with an option to match for case). ‘Character search’ is designed for languages which do not put spaces between words (Chinese, Japanese, Thai) so users can see a concordance for a character (without having to guess how the text has been segmented into words). CQL is the underlying corpus query language, which technically inclined users can input directly in the CQL box.Footnote 6 Other query types are automatically transformed into CQL queries which are then evaluated by the underlying database engine to obtain the results from the corpus.Footnote 7

Fig. 5
figure 5

Query types

Lexicographers often want to home in on a particular pattern of use to explore it further. This can be done with the Context options, as in Fig. 6.

Fig. 6
figure 6

Context filters

To find all instances of catch that have pant or pants within five words, we search for catch with pant in as a lemma filter, with results as shown in Fig. 7.

Fig. 7
figure 7

Concordance for catch with context filter pant (lemma) within five words left or right

Some corpora have the documents within them classified for text type, for example, the “Brown Family” corpus, comprising the original Brown corpus (American English, 1961) and its various clones (for British or American English and at various date points), all with the same structure and genre distribution. Clicking on Text Type in the concordance form (Fig. 5) shows the form in Fig. 8. The user can limit the search to a particular national variety, time, or genre, by ticking boxes.

Fig. 8
figure 8

Text type options in the Brown Family corpus

Once the user has a concordance, there are many things that can be done with it. It can be sorted, sampled, filtered (for example by Context, or Text Type) or saved. A range of frequency analyses are available, including collocation reports and analysis by text types (where the corpus has text types defined). At the level of the individual hit, the user can click on the search term for more context (see Fig. 9), or on the item in the ‘reference’ column to see the metadata for the item.

Fig. 9
figure 9

Concordance display showing how the user can see more context, also showing the ‘Left Hand Menu’ with the range of options for exploring the concordance, and the reference column (in blue) with an identifier for the document that the corpus line came from: clicking on at item in this column will show the metadata for the item

2.3 Thesaurus

The Sketch Engine prepares a ‘distributional thesaurus’ for a corpus. This is a thesaurus created on the basis of common collocation. If two words have many collocates in common, they will appear in each other’s thesaurus entry. It works as follows: if we find instances of both drink tea and drink coffee, that is one small piece of evidence that tea and coffee are similar. We can say that they ‘share’ the collocate drink (verb), in the OBJECT-OF relation. In a very large computation, for all pairs of words, we compute how many collocates they share, and the ones that share most (after normalisation) are the ones that appear in a word’s thesaurus entry. Distributional thesauruses are a topic of great interest in computational linguistics, and show promise for addressing a range of challenges.

The thesaurus entry for tea (in both list and word-cloud form) is shown in Fig. 10.

Fig. 10
figure 10

Thesaurus entry for tea. In the word cloud, the larger a word, the more similar it is to tea

3 Users and uses

3.1 Lexicography

The first Sketch Engine users were lexicographers, with Macmillan as the first user for the word sketches,Footnote 8 and Oxford University Press as the first for the Sketch Engine.

Lexicography, particularly for English and particularly in the 1980s and 90s, was the driving force in the development of corpus methods and corpus use. Lexicography required very large corpora, so there was evidence even for rare words and phrases. At the time—pre-web or in the web’s infancy, pre “big data”—few others in linguistics or the language professions saw any great need for corpora. The English learners’ dictionaries had a vast and growing market, and were highly profitable, and were competing intensively with each other to produce ‘the best’ dictionary. This was fertile ground for innovation.

Lexicography has continued to be a core use for the Sketch Engine, with four of the five main dictionary publishers in the UK (Cambridge University Press, Harper Collins, Macmillan, Oxford University Press) using it intensively. At CUP and Macmillan, this is just for English; at Collins also for the main European languages, and at OUP also for large bilingual-dictionary projects for Arabic, Chinese and Portuguese.

In the UK, dictionary publishing is dominated by companies (and the commercial wings of University Presses); this is possible largely because there is a very large market. In many countries and for many languages, the curation of the national language is seen as a national project, and most lexicography takes place in academies and national institutes. They form a second group of users for the Sketch Engine. The Sketch Engine is in use at national institutes for Bulgarian, Czech, Dutch,Footnote 9 Estonian, Irish,Footnote 10 and Slovak.

3.2 Universities

The Sketch Engine has come out of the academic research world, and, naturally, many of its users are in universities. Within universities, the main kinds of use are

  • in linguistics and languages departments: teaching and research

  • in computing departments: teaching and research in relation to language technology (also called Natural Language Processing, Computational Linguistics). This is the home area of all Sketch Engine team members

  • teaching translation

  • discourse analysis: analyses of a particular kind of language for what it tells us about the attitudes, power relations and perspectives of the participants. This kind of work takes place in a range of departments in the humanities and social sciences. Recent examples include the analysis of British newspaper discourse on migrants and migration; portrayal of science in the news; knowledge dissemination through personal blogs.

3.3 Language teaching

The Sketch Engine is widely used for English Language Teaching and occasionally also for the teaching of other languages including Chinese, Japanese and Arabic.Footnote 11 The ‘Teaching and Language Corpora’ community has been exploring ways of bringing corpus methods into language-teaching practice since Tim Johns’ work in the 1980s. Johns worked in Birmingham, UK, alongside the COBUILD project for using corpora for lexicography, and the uses of corpora for ELT can be seen as having two parts: indirect use, in the preparation of dictionaries (and coursebooks), as covered above, and direct: in the classroom.

A first ELT coursebook based on the Sketch Engine has recently been published (Thomas 2014).

Countries where the Sketch Engine is widely used in ELT include China, the Czech Republic, Germany, Italy, Japan, Spain and Taiwan as well as the UK.

3.4 Translators

Translators find corpora (of specific domains) useful for identifying the terminology and phraseology of the domain, in the language they are translating into. (They will usually be a native speaker of that language, but will often not know the terms and turns of phrase for specialised areas in which they have a translation task). A number of professional translators are Sketch Engine users.

3.5 Terminologists

In the context of large organisations needing to prepare many documents in multiple languages, consistency is a challenge: in particular, the consistent use of the same term (within each language) for the same concept. It is good practice to develop and maintain a terminology, in which there is an entry for each of the concepts in a domain, with a specification of the term to be used in each language. One of the challenges for terminologists is finding the concepts and terms. The Sketch Engine can be used for term-finding (Kilgarriff 2013). This functionality has been developed in collaboration with the World Intellectual Property Organisation.

3.6 Language technology companies

A word list (with frequencies) for a language is a central resource for almost any language technology application, from speech recognition to spelling correction to text prediction. The corpora in the Sketch Engine provide the raw material, and the software can produce the word lists (and also many other lists: of n-grams, keywords, lemmas, terms) for many languages. Several technology companies have been users of this kind.

4 Languages

The Sketch Engine aims to cover all the large languages of the world, as well as any languages which particular users are asking for.

By a ‘large language’ we mean a language with a large number of speakers. The ethnologue website provides a list of languages sorted by numbers of speakers, as shown in Table 1.

Table 1 All the world languages with over 50 million speakers

The Sketch Engine has high-level resources for fifteen of these languages (as well as for many smaller ones) and basic resources for a further four. The languages not covered are Javanese (where there is a complex relationship to Bahasa Indonesia, a variety of Malay, for which there is a basic resource) and four of the languages of India and Pakistan (Lahnda/Punjabi, Marathi, Oriya and Urdu).Footnote 12

The prerequisite for a basic resource for a language, is simply, a corpus (plus segmentation tool where there are no spaces between words). A corpus can be collected from the web, using the Corpus Factory (Kilgarriff et al. 2010) or TenTen (Jakubíček et al. 2013) method.

For a high-level resource, further prerequisites are

  • a tokeniser (for Chinese and Japanese, usually called a segmenter) to identify the words. In simple cases this might just use spaces between words but many languages have clitics and similar needing language-specific treatment. English is a very simple language in this regard but even there, the hyphen and apostrophe characters, and mixtures of letters and non-letters, present challenges.

  • a lemmatiserFootnote 13

  • a part-of-speech tagger

  • a parser or ‘sketch grammar’.

What is also required is a collaborator. This is a person who speaks the language, and is ideally a computational linguist, and who cares about the quality of the output. They might care because they want to use the corpus in their own (or their group’s) work, or because they developed some of the tools and this is an opportunity to thoroughly test them and to show them (via data processed by them) to the world. The collaboration is crucial: without input from people who speak the language, the Sketch Engine team does not know if what it has done for a language is good. A collaborator is needed to point out mistakes and problems, which can then be addressed.

In the following sections we provide details about the status of Sketch Engine integration of various Asian languages.

4.1 Chinese

The collaboration for Chinese began with Prof Huang Chu-Ren inviting the first author to Taiwan in 2004. (At the time, Huang was Deputy Director of the Linguistics Department at Academia Sinica, Taiwan.) Following that visit, and commercial interest in Chinese in the Sketch Engine from CJKI,Footnote 14 the Chinese Gigaword corpus was acquired from the Linguistic Data Consortium, segmented and part-of-speech-tagged at Academia Sinica using the tools developed there, and installed into the Sketch Engine. A sketch grammar was developed and word sketches were made available (Huang et al. 2005; Kilgarriff et al. 2005), as illustrated in Fig. 11. They have supported an extensive research programme since (e.g. Chung and Huang 2010; Huang et al. 2014 forthcoming).Footnote 15

Fig. 11
figure 11

Word sketch for Chinese (attack)

4.2 Arabic

The collaboration for Arabic is more recent, with the Centre for Computational Linguistics at Columbia University, USA (who prepared MADA + TOKAN, the leading tools for tokenisation, lemmatisation and POS-tagging of Arabic) and Arabic experts elsewhere in the USA. in Saudi Arabia and in the UK. Over a number of years we had received many expressions of interest regarding Arabic in the Sketch Engine. But the language presents a number of challenges:

  • there is modern standard Arabic (MSA: the language of the press, education, and officialdom, throughout the Arabic world), Classical Arabic, and the dialects. Most Arabic speakers speak largely their own dialect and are only occasional users of MSA. It is far from obvious what should be included in a corpus;

  • Arabic has many clitics, making tokenisation challenging;

  • Arabic is usually written without vowels;

  • Arabic has a complex morphological system, with a large share of the vocabulary being the result of derivations according to semi-productive processes. A central issue in Arabic lexicography is whether entry should be based on stems (the traditional approach, giving a smaller number of longer entries) or lemmas (which are closer to dictionary headwords in an English or French dictionary).

It has taken several years to assemble all the pieces required for high-quality resources for Arabic (Arts et al. 2014 forthcoming). An Arabic word sketch is shown in Fig. 12.

Fig. 12
figure 12

Word sketch for Arabic (green)

4.3 Other Asian languages

For Turkish there were open-source tools available, including a parser, so a Turkish web corpus was processed with that, and the dependency relations which were the output of the parser were used directly to form word sketches (Ambati et al. 2012), see Fig. 13.

Fig. 13
figure 13

Word sketch for Turkish yürek (heart)

For word sketches for Japanese (Srdanovic et al. 2008), Vietnamese (Hà et al. 2012), and Hindi, see Figs. 14, 15 and 16.

Fig. 14
figure 14

Word sketches for Japanese (accumulate)

Fig. 15
figure 15

Word sketches for Vietnamese tay (hand)

Fig. 16
figure 16

Word sketches for Hindi  (heart)

For Persian, a very large corpus which had been prepared and parsed at Carnegie Mellon University, USA, was loaded into the Sketch Engine.Footnote 16

There are also resources for Azerbaijani, Bengali, Hebrew, Indonesian, Kazakh, Korean, Kyrgyz, Malay, Malayalam, Tajik, Tamil, Telugu, Thai, Tibetan (Garrett et al. 2014 forthcoming), Turkmen, and Uzbek (Baisa and Suchomel 2012).

5 Corpora

“Corpora for all” is the Sketch Engine company tagline: here we give a brief survey of the range of corpora in the Sketch Engine.

Corpora in the Sketch Engine are either owned and managed by the Sketch Engine Team (‘preloaded’ corpora), or are user corpora, owned and managed by the user.

5.1 Preloaded corpora

5.1.1 General language

The primary goal is to provide, for each language, a large, recent, general language corpus for the language, processed with high-quality tools for the language, with word sketches, and checked extensively by one or more collaborators. These corpora are for lexicography and general language research, for example into the syntax or morphology of the language. ‘Large’ means at least 50 million words, and for recent work with large languages, several billion. In most cases these are web corpora, as the web is the only place to get material in vast quantity and covering a wide range of text types and domains. In some cases, for example Estonian or Irish, where there is collaboration with an organisation which has gathered a large corpus using some other method, we have combined web-sourced and other material.

These corpora can be kept up-to-date by crawling again and adding new material.

There are large, general-language corpora for sixty languages.

5.1.2 Parallel

One central language task is translation. For that, a key resource is the parallel corpus, comprising sets of texts which are translations of each other (or, are both translations of the same source). Parallel concordancing, as in Fig. 17, is where a user inputs a search term in one language and sees pairs of sentence: those with the matching term in the first language, and the corresponding sentence in the target language.

Fig. 17
figure 17

Parallel concordance, Chinese and English, and smile

In the Sketch Engine there are data for 300 language pairs. These data are from two main sources: EUROPARL and OPUS. EUROPARL comprises speeches made at the European parliament, which have been translated into 21 official European Union languages (Koehn 2005).Footnote 17 The OPUS data, a collection of parallel corpora collected in the OPUS project, are made available on its website.Footnote 18 It comprises many different parts, two of the largest (for most language pairs) being documents from the United Nations, and Open Subtitles.Footnote 19 Figure 17 shows subtitle data.

In addition to all of the functionality shown so far, an extra option for parallel data is to search simultaneously in both languages: Fig. 17 shows the output when is searched on the Chinese side, smile on the English.

5.1.3 Second/foreign language learning and teaching

In the context of language learning, two central questions arise:

  • what are learners saying and writing?

  • what should they be saying and writing?

For the first, there are learner corpora.Footnote 20 Learner corpora are valuable for finding out what learners, at various levels, do, and for research into the process of language learning as well as the practicalities of curricula, course development, and testing. In the Sketch Engine there are learner corpora for Slovene, Czech and English.Footnote 21

For the second, the general answer is “the language”, and general language corpora meet that need. But there is also a more specific answer: one large population of language learners are learning English, and would like to study at an English-medium university. Thus their target is the English that is spoken in seminars and written in University-level essays, by accomplished English speakers. The British Academic Spoken English (BASE) and British Academic Written English (BAWE) corpora have been created as samples of these target varieties.Footnote 22

5.1.4 Historical

A central topic for linguists is language development and change. Corpora looking back over the history of a language, and supporting this kind of research, include LatinISE (of Latin from the third century B. C. to the twentieth century A.D., McGillivray and Kilgarriff 2013), GermanC (of German from the seventeenth and eighteenth centuries; Scheible et al. 2011) and English Dialogues Corpus (sixteenth–eighteenth centuries; Culpeper and Kytö 2010).

For the Arabic world and Islam, the language has a special role. It is the language of the Quran and of the culture that the region shares. The different countries each have their own dialect, and the lingua franca, MSA, is closer to classical Arabic than to the dialects. The King Saud University Corpus of Classical Arabic (KSUCCAFootnote 23) brings together many of the central texts of this language, culture and religion, including the Quran and the Hadith.

5.1.5 Learning to speak

Since 1984, the CHILDES and Talkbank projects, based at Carnegie Mellon University, have been gathering child–adult conversations.Footnote 24 They are largely between babies and young children and their carers (with many of the carers being linguists, who have taken on the recording and transcription of the data). All are available as transcripts, and many also as audio or video. The data can be explored on the Talkbank website as well as the Sketch Engine: the two websites are complementary, with Talkbank expecting the user to be a developmental or general linguist, and the Sketch Engine expecting them to have a corpus orientation. There is a CHILDES corpus in the Sketch Engine for 22 languages, varying in size from a few thousand words to, for English, 23 million.

5.1.6 Learning to read and write

Educators, children’s authors and publishers, and linguists and psychologists studying the process of learning to read, are interested in the language that schoolchildren read and write. So are producers of children’s dictionaries. The Education division of Oxford University Press has created the Oxford Children’s Corpus (Wild et al. 2013), comprising both material written for children (largely stories, many being titles published by OUP) and stories written by children. This second part resulted from a competition led by the top UK disc jockey Chris Evans, who, from his show on BBC Radio 2, invited children to write a 500-word story and send it into him. In 2014, 115,000 British children did so. The BBC then made the data available to OUP for linguistic research.Footnote 25

The size of the corpus, as at April 2014, is 115 million words.

5.1.7 Reference corpora

The Brown corpus was central to the development of corpus linguistics. It was one million words, comprising five hundred 2,000-word samples from 13 different genres, all of American English published in 1961. It has played a huge role as a point of reference ever since and has spawned ‘Brown family’ corpora for British and American English, at a number of time points (see Fig. 5).

Another key reference corpus for English is the British National Corpus (Burnard 1995).

5.1.8 Sociolinguistics

Sociolinguists are interested in how language varies between social groups, across age groups, with movements of populations, and between communities. A corpus designed to study these topics is the London English corpus (Kerswill et al. 2013).

5.2 User corpora

As well as preloaded corpora (managed by the Sketch Engine team) users can upload, build, process, share and explore their own corpora.

Where a user already has a corpus, they can upload it and install it, via a simple web interface. The source documents can be in any of the common formats (doc, html, pdf, txt, tmx) and may also be compressed and/or archived (.zip, .gz, .bz2, .tar). All of these formats are then converted to plain text (.txt). If the data are already annotated (perhaps with part-of-speech tags, or lemmas, or for discourse function, etc.) then it needs to be in the Sketch Engine’s input format, ‘vertical’ text, as documented in the Sketch Engine help pages. The user can then manage their own corpora, including adding more data, deleting, and processing (see below) as well as using them for their research via the core Sketch Engine functions (as in Sect. 2).

BootCaT (Baroni and Bernardini 2004) is a procedure for building a corpus, starting from a set of ‘seed words’, by making tuples (typically triples) of the seed words, sending each tuple as a query to a search engine, and then gathering the web pages that the search engine finds. When applied to a specialist domain, with seed words from that specialist domain, it turns out to be a remarkably efficient way of discovering the terminology and phraseology of that domain. The Sketch Engine includes an implementation called WebBootCaT. See Fig. 18 for the WebBootCaT form and Fig. 19 for the keywords and terms found, fully automatically and in a few minutes, for the vulcanology domain.

Fig. 18
figure 18

WebBootCaT form

Fig. 19
figure 19

Terminology in the volcanoes domain, as extracted from BootCaTted Vulcanology corpus. Items in green were input seeds. Number (in blue) can be clicked to see concordances

More can be done with a user corpus if it is accurately tokenised, lemmatised and part-of-speech tagged. Tools for these processes are language-specific. For 11 major world languages, tools have been identified, licenced (where necessary) and installed. Some users have used the Sketch Engine explicitly for this service: they can upload plain text, get it processed, and then download the processed data (in vertical format), perhaps for further annotation and re-uploading.

Processing options are steadily being added for more languages.

There is ‘access control’ for user corpora. By default, the only user who can see a corpus is the person who created it, but they can give access to others (at a web interface). This allows a teacher to give access to a corpus that they have prepared, to their students, or a researcher to share their corpus with their colleagues.

6 New functionality

At ten years old, the Sketch Engine is mature software. There has been a steady stream of new functionality as well as bug-fixes and improvements to usability. Recent usability improvements include

  • a ‘breadcrumb trail’ to show a user how they got to the concordance they are looking at, which might be the results of an original search plus sorting, filtering and sampling. This was a response to user feedback that it was easy to lose track of what a particular concordance was

  • ‘more data’ and ‘less data’ buttons for work sketches. The number of collocates shown in a word sketch is defined by three parameters: a frequency threshold, a salience threshold and a limit to the number of collocates per list. Users sometimes want to see more collocates than they have been shown in the first instance—and sometimes they feel overwhelmed and want to see less. But if they start considering the three parameters, they are bamboozled. Hence the ‘more data’ and ‘less data’ buttons

  • thesaurus word clouds: see Fig. 9.

Other additions to functionality, which complement the core functions described in Sect. 2 and the preloaded and user corpora as described in Sect. 5, include the API, GDEX, bilingual sketches, keywords and ‘comparing corpora’, terminology, and localisation.

6.1 API

A simple JSON API allows other programmes to access word sketches, collocations, thesaurus entries and to find the terminology in a document.

6.2 GDEX

Dictionary users like examples. This is a clear finding of dictionary user research (Frankenberg Garcia 2014). Where the dictionary is to be published on paper, not many examples can be offered owing to space limitations. With electronic dictionaries, that constraint disappears. The constraint becomes, rather the editorial time needed to prepare them. There are already compelling linguistic reasons for taking examples from corpora rather than inventing them (Hanks 2012): could the corpus software not merely find the examples for a word, but automatically find the good ones, for using as dictionary examples?

A Good Dictionary Examples (GDEX) function was added to the Sketch Engine in 2008 and has had many enthusiastic users. It was originally applied to English (Kilgarriff et al. 2008) and has since been used for a number of languages including Slovene (Kosem et al. 2011). It works by sorting a concordance, so the corpus lines judged best by the algorithm are shown first. Then the lexicographer should not have to read many of them before finding a good one. The same core technique has also been used to score documents and to exclude low-scoring documents from a corpus entirely.

The critical outstanding question, for dictionary publishers, is this: can GDEX work well enough, so that example sentences can be added to dictionary entries without an editor needing to check them first? This is a goal, but a number of obstacles stand in the way. First the corpus needs to be very big, to provide plenty of examples for the algorithm to choose amongst, and usually the only way to get a very large corpus is from the web. But web corpora contain web spam, which sometimes makes it past all other filters and makes bad dictionary examples.

Second: parsnips. Parsnips is an acronym for the potentially offensive topics which teaching materials, which will be seen across the globe, by all communities and cultures, might be wise to avoid. It stands for Politics Alcohol Religion Sex Narcotics Isms Pork (as a stand-in for various foods which are taboo in various cultures). A second current challenge is to scrub the data clean of parsnips.

6.3 Bilingual sketches

Where monolingual lexicographers appreciate monolingual word sketches, bilingual lexicographers would like bilingual ones. They have recently been developed (Baisa et al. 2014) and are currently being rolled out for more language pairs.

6.4 Keywords and corpus comparison

Where there are a number of corpora available for a language, the question arises, “how do they compare?” This has been the central research question for the first author for some years (Kilgarriff 2001, 2012) and the Sketch Engine supports a range of comparisons, quantitative and qualitative, between any pair of same-language corpora: see Kilgarriff (2012) for English and Czech, Kilgarriff and Renau (2013) for Spanish.

6.5 Terminology

To find the terminology of a domain, in a language, the requirements are

  • a domain corpus

  • a reference corpus

  • a grammar for terms

  • a lemmatiser, part-of-speech tagger, and parser (to find linguistic units with the grammatical shape that makes it possible for them to be terms)

  • a statistic (to identify the term candidates that are most distinctive of the domain in contrast to the reference material).

The Sketch Engine has most of these pieces in place. Users can upload their domain corpus, or build one using WebBootCaT. Reference corpora are available for 60 languages. There are already grammars for the word sketches, which can be adapted to provide term grammars. The parsing machinery is in place, and, as discussed above, for a growing number of languages, language-specific processing tools are installed and ready to use. The statistic used to identify keywords is also suitable for identifying terms.

Term-finding functionality is now available in the Sketch Engine, as illustrated in Fig. 19, for ten languages.

6.6 Localisation

Machinery to support the localisation of the interface has been added. Currently, the interface can be seen in Czech, Chinese, English, Irish, Slovene, and Croatian.

7 Related work

The Sketch Engine is both a corpus query tool and a web service; the web service includes corpus building and management. We take each in turn.

7.1 Corpus query tools

Software tools for corpus exploration fall into two categories: those designed for installation on each computer where they are used, and those designed for installation on a server.

Widely used tools for local installation include (in order of their invention), Monoconc/Paraconc (since 1995), WordSmith (since 1996), Antconc (Anthony 2004; since 2004) and Concgram (Greaves 2009; since 2005). Antconc is free, and the other three are commercial products. All have many enthusiastic users. All can be used over a network, but this is not their normal mode.

Amongst tools for use over a network, the IMS Corpus Workbench has pride of place. IMS is the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart, where the tool was developed in the early 1990s (Christ and Schulze 1994). (It is also often referred to as CWB, Stuttgart tools, or CQP, for its “Corpus Query Processor”.) It has been widely used and has a community of developers working with it. The original version was pre-web, and the envisaged network was within a University. A central question since is as to how it can be made to work well on the web. The usual solution has been that it provides a back end, and then a number of front ends have been prepared. CQPWeb (Hardie 2012), for example, combines the IMS Corpus Workbench back end with a MySQL database.

The Stuttgart tools and CQPWeb are both free and open-source, and the community of developers for corpus software has a strong commitment to open source. While the Sketch Engine is not open source, as this could undermine its viability as a business, a version of it, NoSketchEngine, is open source.Footnote 26

The functionality of all of these tools (and most of those covered below) comprises a concordancer, plus various ways to manipulate concordances, plus a range of summary reports. There is little disagreement about the value of the various reports, and the functionality differences lie rather in how much time and motivation the developers have had to develop more functions. As one of the more mature tools, working commercially with a support and development team of seven, the Sketch Engine has more functions than most.

7.2 Corpus websites and services

There are a small number of corpus websites for multiple languages and a large number for a single language (and usually, a single corpus). We review those that cover multiple languages in some detail below. We do not cover the single-language ones: there are too many, many of which are short-lived. It is de rigeur for any corpus project to make its corpus available over the web, and this is typically done on a dedicated website, sometimes using the Stuttgart tools as the back end and sometimes using software developed as part of the project. Such projects are often national projects, and one advantage is often that the interface is in the same language that the corpus is a corpus of. For scholars of that language who may not be at ease in English, this may be a major advantage. A disadvantage of developing software afresh is that the software will be new: it is likely to be less robust, with less functionality, than mature systems. The funding will end and then it will be hard to maintain. Growing numbers of corpus developers are taking the route of making their corpus available in the Sketch Engine.Footnote 27

7.2.1 Corpus websites for multiple languages

Mark Davies’s website at Brigham Young UniversityFootnote 28 offers corpora for English, Spanish and Portuguese. The resources for English are outstanding, supporting the exploration of the behaviour of words and phrases across time, genre, and regional varieties (Davies 2009). The system is fast and reliable.

Uwe Quasthoff and colleagues in Leipzig have crawled the web for corpora of 229 languages and made them searchable at their Worschatz website (Quasthoff et al. 2006).Footnote 29 The website is in German. Within Germany, this is a very widely used site: it serves as a main reference for language questions from laypeople.

Eckhard Bick has focussed on syntax and parsing. The Visual Interactive Syntax Learning websiteFootnote 30 has corpora which are often modest in size but are parsed. The website has games and quizzes to support language education, as well as Deepdict, comprising word-sketch-like reports, for nine languages (Bick 2009).

The OPUS project (Tiedemann and Nygaard 2004) has gathered parallel corpora and organised them so they are both searchable on the website (with the Stuttgart tools as the back end) and also downloadable, for use in other research and software (including the Sketch Engine; many of the parallel corpora in the Sketch Engine were taken from the OPUS site).Footnote 31

At the University of Leeds in the UK, Serge Sharoff makes web corpora for 13 languages available to all for searching, again using the Stuttgart tools back end (Sharoff 2006).Footnote 32

All of these sites are free to use. This is in contrast to the Sketch Engine. Most are based in Universities and are supported via research grants and academic salaries. While, naturally, most people would rather not pay (other than via their taxes), the commercial model has advantages. There is an income stream to support the maintenance and development of the software and web service for the long term, and customers with particular requirements can get what they need, by paying for it.

Google and other search engines do a similar job to a corpus website: they allow the user to find many instances of a word, in context, as a dataset for further study, and they do it fast. Where the user knows of no corpus for the language, or the item they are searching for is rare so not enough data are available via dedicated corpus linguistic tools, Google may be the best tool to use. For a discussion of the use of search engines for corpus research see Kilgarriff (2007).

A possibility lying between the search engine and a corpus tool is the metasearch engine, in which a corpus tool takes a user’s query, passes it on to Google or another search engine, receives the results, and filters and displays them in ways that are useful for language researchers. The best known tool of this kind is Webcorp (Renouf et al. 2006).Footnote 33

Other corpus-like websites, mentioned here for completeness, are:

  • Wikipedia : the wikipedia for a language is a convenient corpus for that language, as used, for example, as a starter corpus in Kilgarriff et al. (2010).

  • Project GutenbergFootnote 34

  • Google booksFootnote 35

  • the Linguistic Data ConsortiumFootnote 36 and European Language Resource Association,Footnote 37 for catalogues of available resources, including corpora of various kinds for many languages

  • Linguee,Footnote 38 Webitext,Footnote 39 parallel concordancers offered as a service to translators.

7.3 Tools for corpus building and annotation

The BootCaT procedure is described above. There are a number of implementations, including one from the University of Bologna group where the idea was originally developed.Footnote 40

Several groups have developed pipelines for web corpus building. The steps are

  • web crawling

  • removing duplicates

  • ‘cleaning’ to remove non-text material

  • language identification

  • linguistic processing (tokenisation, possibly also lemmatisation, part-of-speech tagging, parsing).

The pipeline used by the Sketch Engine team uses three tools which were developed within the group and have now been published as open-source software: spiderling (Suchomel and Pomikalek 2012) for crawling, onion for deduplication, justext for cleaning (both Pomikálek 2011). Other pipelines, with similar philosophy and components, have been developed in Bologna (Baroni et al. 2009), Leipzig (Biemann et al. 2004) and Berlin (Schäfer and Bildhauer 2013).

Annotating a corpus with human input (as distinct from a fully automatic process) is supported in a limited way in the Sketch Engine, via facilities developed for Hanks’s Corpus Pattern Analysis (Hanks 2008).Footnote 41 Many tools have been developed specifically for manual and semi-automatic corpus annotation, leading examples being the UAM tool (O’Donnell 2008) and the Groningen Meaning Bank tool (Basile et al. 2012).Footnote 42

8 Conclusion

The Sketch Engine is a leading corpus tool (both in the sense of ‘corpus query tool’ and in the sense of ‘corpus web service’). It is now 10 years old: a 10-year period that has seen revolutions in connectivity, devices, and dictionary publishing, and the worldwide spread of corpus methods in dictionary-making. It is mature software offering a wide range of functions, with the web service offering many corpora for many languages, as well as services for corpus building and maintenance.

In this paper we have described word sketches, concordancing, and the thesaurus (Sect. 2), the different kinds of user (Sect. 3), and approaches to working with many different languages (Sect. 4). Section 5 reviewed the kinds of corpora available in the Sketch Engine, including user corpora and the ways of building and working with them. Section 6 gives a brief tour of some of the innovations and new reports offered in the past few years. In Sect. 7 we reviewed related work.

As the strapline has it, ‘corpora for all!’