Keywords

1 Overview

1.1 Introduction

Language use and writing strategies are two inseparable facets of the same process: knowledge creation and sharing. In order to produce valuable pieces of writing, either creative or scientific in nature, writers of all ages and competence levels are challenged with tasks that range from simple word selection (Cameron & Dempsey, 2013) to adapting writing style to being formal or informal (Reppen et al., 2002). The first inventories of words were dictionaries which were structured archives of contextual uses of linguistic items present in the language at a certain point in time. They also served as the first linguistic research outcomes. In this context, the emergence of corpora seemed like the natural methodological evolution in language sampling and research. When lexicographers started collecting data for corpus based dictionaries (Teubert, 2007), their purpose was not only to disambiguate vocabulary terms and their meaning but also to provide lexical options based on authentic language samples (Hanks, 2009).

Being collections of naturally occurring samples of language, corpora represent reliable guidance resources for writers of all disciplines, genres and purposes. Beside instant access, as simple user, student or researcher, another applicability of corpora for language use in writing or writing process in general is the facilitation of digital tool creation:

People are not generally aware that computational linguists use corpora to develop all sorts of language tools that have become commonplace in our everyday lives, from simple spell checkers, to auto-correct options in word processors and web browsers, to sophisticated machine translation programs. (Frankenberg-Garcia, 2014)

Besides basic challenges such as choice of words, in the process of writing, a frequently encountered problem is the writer’s block, a phenomenon which is intrinsically cognitive (Hodges, 2017) but which can be overcome, oftentimes, through linguistic support. This can be automatic in nature, like paragraph generation (Duval et al., 2021) or support during the lexical refinement process (Baker-Brodersen, 1988). Such prompts are often based on corpora, and they are readily available online provided that the user is aware of the limitations of corpus queries (Kaltenböck & Mehlmauer-Larcher, 2005).

1.2 Evolution of Corpus Linguistics

Nowadays, corpora represent collections of texts that are collected, processed, analysed and exploited with the help of computer technology. But corpora have not always been digital and, as the name corpus implies, i.e. 'body' of language in Latin (Bondi, 2017, p. 46), they existed even before the advent of technology, when linguists used pre-computer corpora as a base for their linguistic studies (Biber & Reppen, 2015, p. 2). For example, when writing the Dictionary of the English Language, published in 1755, Samuel Johnson used around 150,000 natural sentences written on slips of paper to show the natural use of words (p. 2). Up to the 1960’s other noteworthy works include dictionaries, e.g. The Oxford English Dictionary published in 1928, empirical vocabulary studies, such as the General Service List (West, 1953), and grammar studies, as, for example, the two American English corpus based grammars by C. C. Fries published in 1940 and 1952 (Biber & Reppen, 2015, pp. 2–3) (Fig. 1).

Fig. 1
An illustration of the history of corpus linguistics beginnings. It includes the Dictionary of the English Language, the Oxford English Dictionary, the American English Grammar, general service list, American English Grammar, followed by many dictionaries in between to the frown corpus.

History of corpus linguistics beginnings (Timeline)

An important change occurred in the 1980’s when large electronic corpora became widely available and computational tools started to be used to perform linguistic analyses on that type of corpora (Biber & Reppen, 2015, p. 3). This gave rise to a flurry of linguistic studies using electronic corpora that focused on various linguistic features, ranging from lexis and grammar to register variation (pp. 3–4).

The two milestone corpora, Brown and LOB, have been paralleled by later versions, Frown (Freiburg-Brown corpus of American English) and FLOB (Freiburg-LOB Corpus of British English), initiated by Christian Mair at the University of Freiburg in Germany in 1991. The linguistic data in the later versions were meant to reflect the language development from the initial corpora (1960’s) to that time (1990’s).

Since then, the continuous advances in technology enabled the use of electronic corpora and corpus tools at a very large scale. At the moment, corpus linguistics is a well-established discipline, and its data analysis methods contribute to investigating language from various perspectives related to topics such as registers, dialects or entire languages (Egbert et al., 2020, p. 3).

1.3 Core Idea of the Technology

Corpora can be broadly defined as machine-readable sets of texts compiled following criteria that are analysed with the help of computer software, as they are too large to undergo manual analysis (McEnery & Hardie, 2012, pp. 1–2). The way a corpus is built is of very much importance, because a corpus should ideally “represent, as far as possible, a language or language variety as a source of data for linguistic research.” (Sinclair, 2005).

Web-based corpora, mainly composed of web pages, are the largest corpora available, containing billions of words. For example, the filtered version of the Common CrawlFootnote 1 used for the pre-training dataset for GPT-3 (i.e. Generative Pre-trained Transformer 3Footnote 2) consists of 410 billion tokens. This large quantity of data enables powerful quantitative analyses. In addition, it is now possible to apply corpus linguistics methods to less structured language repositories, such as text archives (e.g. Lexis-Nexis, Google Books), or even the entire web. Common search interfaces allow basic queries that can yield linguistically relevant results. More powerful, however, are the so-called corpus architectures, which enable more complex queries usually found in corpus linguistics tools. Examples of corpus architectures include the googlebooks.byu.edu interface, which uses n-grams extracted from the Google Books, and the web-based tool Sketch Engine which, along a variety of other corpora, hosts several enormous web-based corpora that can be searched using all the tool’s features (Davies, 2015, pp. 19–22).

While web-based corpora are very successful in representing the genres normally found on the web, e.g. newspaper articles, they cannot offer a comprehensive picture of other language varieties, such as fiction or spoken language. General purpose genre-balanced corpora seem to be a good middle ground between size and representativeness. Corpora of this type contain sub-sections which are representative of several registers, and are also considerably large in size, so that powerful statistical analyses are supported. Two famous genre-balanced corpora are the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). COCA, which currently contains more than one billion words, is representative for eight registers including academic texts, speech, and fiction. New data is continuously being added to the corpus in a controlled manner, much attention being dedicated to preserving the genre balance in each subsection.

Even so, in certain cases, the language domain studied is composed of texts that are not found in general-purpose corpora. This, therefore, requires the usage of a specialized corpus, a type of corpus that represents as far as possible the full range of linguistic variation from a specific variety of language (Clancy, 2010, p. 82). The representativeness of the corpus is more important than its size, because it was proven that a well-designed specialized small corpus can provide more relevant results regarding “specialized lexis and structures” (O’Keeffe et al., 2007, p. 198) than a large corpus that was not customized to meet the researcher’s needs (Nesi, 2012, p. 408).

Yet, there are situations in which no ready-made corpus can meet the needs of a specific research question, and, in these cases, scholars need to compile new corpora. Sometimes called DIY corpora, these corpora are compiled in basic formats, e.g. txt files, and are smaller than ready-made specialized corpora, but because they contain only the language variety under investigation, their analysis yields valuable results (Nesi, 2012, p. 408). However, most of the time DIY corpora remain private due to copyright laws.

1.4 Processing and Tools

In order to apply corpus linguistics methods to a data set, several steps need to be taken. The corpus is first compiled, then the corpus data is annotated, and, finally, the corpus is analysed using corpus linguistics software (Rayson, 2015). Annotation is a procedure which “allows the researcher to encode linguistic information present in the corpus for later retrieval or extraction” (Rayson, 2015, p. 38). Certain types of annotation can be done automatically, while others are done manually. Automatic annotation with a high degree of accuracy has already been achieved for English (and other major languages) at the levels of: “morphology (prefix and suffix), lexical (part-of-speech and lemma), syntax (parsing)” sense) (Rayson, 2015, p. 39), and, in many cases, semantics (semantic field). However, one downside of automatic annotation is that it is not accurate enough for every language. Manual annotation, on the other hand, is done for areas not supported by automatic annotation, e.g. discourse (Rayson, 2015).

After having been compiled and annotated, the corpus can be searched using software tools for corpus analysis. The tools can be standalone software that one installs on their computer, e.g. Wordsmith, Antconc, Lancsbox. One important goal of these tools is to be user-friendly. However, they still require a learning curve, and this may discourage non-corpus linguists from using them.

1.5 Functional Specifications

The use of corpora equals, primarily, as previously mentioned, access to authentic language samples. Such access can be performed unsystematically, via large search engines (e.g. Google), or in a more structured manner, through dedicated corpus search platforms (e.g. COCA corpus platform). Nevertheless, users should consider the following types of shortcomings in relationship to both access situations: first, unsystematic databases contain linguistic information which is unfiltered, not properly verified nor structured; then, corpus platforms, while including linguistic information that has been collected according to specific representativeness criteria, are, quite frequently, not open source (i.e. licence-based).

Writing-focus tools are built and designed to integrate large amounts of linguistic information with the purpose of extracting statistically validated language patterns and offer context specific solutions. For example, such instruments can perform instant searches in their in-built corpora, select multiple word associations and generate best-matching collocation lists. This could benefit writers who are uncertain about the grammatical construction of a linguistic cluster, about the lexical choice within a structure or about the phraseological options to mark a specific rhetorical move.

The ultimate benefit of collecting large amounts of linguistic data is opening up immense possibilities for research and applications in areas at the intersection of Natural Language Processing and Artificial Intelligence. Because most large corpora nowadays can be easily compiled using web-scraping methods (see previous sections), computers can be trained to recognize linguistic patterns and predict others.

Since this latter aspect is quite vast and requires clarifications which are beyond the scope of explaining how corpus linguistics contributes to writing research and applications, in general, we exemplify such uses in the following two sections: corpus linguistics for writing studies and corpus related writing applications.

2 Corpus and Writing Research

2.1 Learner Corpora

The language produced by foreign, or second language learners is called learner language (Gilquin & Granger, 2015, p. 418) and it is investigated within a branch of corpus linguistics named Learner Corpus Research. Research in this field has provided valuable insight into various learner language areas, such as grammar, lexis, phraseology, various discourse phenomena and pragmatics. Since English is the preferred language for “international research and global communication” (Flowerdew, 2015, p. 466), and, as a consequence, non-native novice writers are required to master English academic writing norms, learner corpus research covers many aspects of writing in English. The learner corpora investigating English for Academic Purposes (EAP) are of two types: English for general academic purposes (EGAP) corpora and (2) English for specific academic purposes (ESAP) corpora. Corpora of the EGAP type contain writing common to multiple disciplines, such as argumentative essay writing on general topics, which, even if it is not discipline specific, helps students “practise the same rhetorical functions found in disciplinary writing” (Flowerdew, 2015, p. 468). One such corpus is the International Corpus of Learner English (ICLE) (Granger et al., 2009), which consists of essays written by undergraduate students, foreign or L2 learners of English from various L1 backgrounds.

The ESAP type of corpora usually contain texts which are representative of “written disciplinary genres that tertiary students have to master” (Flowerdew, 2015, p. 467), and contain sub-corpora divided by discipline or genres. Large-scale international corpus building initiatives exist, such as the Varieties of English for Specific Purposes (VESPA) and the Corpus of Academic Learner English (CALE). Both corpora aim at collecting texts from multiple L1s, disciplines and genres (Flowerdew, 2015, p. 468). Other ESAP learner corpora have been compiled for a specific ESP/EAP context, and they usually consist of texts from certain L1 users or from certain disciplines or genres. A case in point is the Romanian Genre Corpus / ROGER (Chitez et al., 2021), a comparable bilingual corpus which comprises university writing by L1 Romanian students, in their mother tongue and in English as a Foreign Language.

2.2 Research, Teaching and Development

As explained at the beginning of this chapter, corpus linguistics has become an independent and multivalent discipline which has attracted the attention of many researchers. With a history of almost a century, corpus based research has migrated from the field of linguistics towards interdisciplinary areas that centre round information technology. Corpus linguistics research is now performed at departments of modern languages and IT alike, with extensions towards Digital Humanities approaches. This multidisciplinary expansion has also been absorbed by teaching initiatives in all types of educational settings: pre-university language related approaches, university corpus based teaching and post-university further education programs. But the group that profits the most from the existence and improvement of corpus based writing research methods is the application and development group, represented by the applied research departments at universities and language related industry. It is now widely acknowledged that, by compiling linguistic datasets, practical tools and digital products can be developed which are supposed to improve processes in all sectors that involve linguistic analyses or language use, including writing. Numerous products (see Sect. 3) have been launched internationally and have billions of users.

3 Main Products

3.1 Corpora as a Basis for Primary Linguistic Tools for Writers

There are two main categories of corpus based primary linguistic tools for writers that have helped both expert and novice writers to foster their general or academic writing skills: dictionaries and phraseology databanks. The first category is fairly widespread and it is the main language instrument that students, teachers and general language users consult in order to validate their linguistic choices or search for refined alternatives. The inventory created by Frankenberg-Garcia (2014) includes the textbook and dictionary series of the five major UK academic publishers: Cambridge, Collins, Longman, Macmillan and Oxford. All of them have produced language support resources that target the general language user (e.g. Cambridge Dictionary of American English), the grammar rule seeker (e.g. Cambridge Grammar of English), the L2 English language user (e.g. Collins COBUILD English Dictionary for Advanced Learners) or the writing challenged user (e.g. Macmillan Collocations Dictionary). The Cambridge series is quite rich with books from the following internationally used series: Cambridge Dictionary of American English, Cambridge International Dictionary of English, Cambridge Grammar of English, Cambridge Learner Corpus, Touchstone series, Vocabulary in Use series. They are based on the Cambridge English Corpus, which includes all the words at CEFR levels A1–C2. The Cambridge corpus based language aids are mainly used by those who want to write in a native-like manner.

Picture 1
A screenshot of example of a phrase from the academic phrase bank that is compare and contrast which gives a description of the phrase.

Examples of phrases for ‘Compare and Contrast’ in Academic Phrasebank

In the second category, a valuable academic writing resource that has corpus based research at its roots is the Academic Phrasebank (Morley, 2018), developed at the University of Manchester (Picture 1). The phrases have been ‘harvested’ (Morley, 2018, p. 4) from a corpus consisting of “100 postgraduate dissertations completed at the University of Manchester” while “phrases from academic articles drawn from a broad spectrum of disciplines have also been, and continue to be, incorporated” (p. 4).

3.2 Corpus Based Data Driven Learning

The use of linguistic corpora has not been limited to research in the field of corpus linguistics, but instead it has become an indispensable practice in all language related areas such as translation studies, applied linguistics, sociolinguistics or language teaching. The use of linguistic corpora has garnered the interest of researchers, teachers and students alike (Boulton & Tyne, 2013; Tribble, 2002). Corpus based teaching activities have proven to have a positive effect on the students’ linguistic competences, as their writing improves at multiple levels, such as, for example, lexico-grammatical features (Boulton & Tyne, 2013; Chitez & Bercuci, 2019; Cortes, 2007; Levchenko, 2017; O’Sullivan, 2010).

The study by Tatyana Karpenko-Seccombe (2020), Academic Writing with Corpora: A Resource Book for Data Driven Learning, introduces the latest corpora-based resources suitable for teachers and students interested in language and writing improvement. Beside introducing various online corpora and several free-to-use tools, the book also provides practical examples of corpus based language acquisition improvements and shows the practicality of corpora in improving academic writing, both at micro (e.g., argumentative writing) and macro (e.g., writing a literature review) levels.

Most corpora in English can be used in classroom activities for teaching academic writing, whether as general/reference or specialized corpora. Many such resources are readily available online on websites comparable to https://www.english-corpora.org/, the largest and most frequented online resource of English-language corpora. For example, COCA (Corpus of Contemporary American English) has been used by Chang (2014) alongside a private specialized corpus (Michelangelo) to improve the students’ writing in ESL (English Second Language). Likewise, the BNC (British National Corpus) and iWEB (formerly BYU Corpora) have been used productively by Khan (2019) to teach academic lexical bundles to ESL students. As far as specialized corpora are concerned, MICUSP (Michigan Corpus of Upper-Level Student Papers) has been used by Ädel (2010) to effectively introduce students to rhetorical moves in academic writing. Similarly, The ICLE corpus family—International Corpus of Learner English (Granger, 2003) has been used in numerous studies (e.g. McEnery et al., 2019) to analyse interlanguage phenomena or extract potential learner error areas that can be exploited pedagogically. More recent academic writing databases are: CROW (Corpus and repository of writing) (Staples & Dilger, 2018), containing US college writing samples, and ROGER (Corpus of Romanian Academic Genres) (Chitez et al., 2021), containing university students writing in Romanian L1 and English L2.

As many experts note, one of the most successful methods of integrating corpora in teaching academic writing was by having students create their own specialized corpora (Chang, 2014; Cortes, 2007; Levchenko, 2017; Yoon, 2008). To this end, there are undeniable benefits of user-friendly software that can be used in corpora-based teaching activities for academic writing classes both by teachers and their students. Standard corpus analysis tools are the free-to-use #Lancsbox (Brezina et al., 2020) and AntConc (Anthony, 2022), the available-for-purchase WordSmith Tools (Scott, 2020) and many others that are mentioned on the webpage Tools for Corpus Linguistics.Footnote 3

3.3 The Use of Built-In Corpora in Writing Tools

3.3.1 Corpus Based Writing Improvement Tools

Corpus based writing improvement tools integrate searches specific to corpus linguistics into user-friendly, web-based platforms. In other words, users can perform linguistic searches in a variety of corpora hosted by the platform. Some of these platforms are commercial (e.g. Ludwig.guruFootnote 4) and others are developed in academic contexts (e.g. AWSuMFootnote 5). The commercial platforms address multiple audiences, such as scholars, students, or professionals, whereas the academic tools target academic oriented audiences, such as students or researchers.

The target audience influences the corpus data contained by the platforms. Ludwig.guru, directed towards several audiences, hosts a variety of corpora divided into several categories based on register (e.g. News and Media, Science and Research or Formal and Business). In addition, users can create a corpus with their own linguistic data. By contrast, AWSuM, directed towards an academic audience, contains a corpus of academic writing, divided into two datasets made up of published research articles from two disciplines: Applied Linguistics and Computer Science. One important advantage of the AWSuM corpus is that it has been annotated for rhetorical moves (Atsushi, 2017).

Ludwig.guru provides several corpus based search features. Basic and complex free searches in the corpora hosted by the platform can be performed. The user can input a search word or phrase and explore their use in a variety of authentic language contexts (Picture 2). An example of a complex free search is the use of the wildcard “_”, through which the user gets synonym suggestions for a certain word in a phrase, as shown in Picture 3.

Picture 2
A window displays a complex free search, exemplified by a search through which the user gets synonym suggestions for a search word in a phrase.

Concordance function Ludwig.guru

Picture 3
A window displays the frequency comparisons of two words or two sentences in a phrase.

Synonym search Ludwig.guru

In addition, the frequency of two words or two sentences can be compared (Picture 4). This can be useful when the writer is unsure of the structure of a multi-word unit or what words that have a similar meaning are preferred in a certain register (Charles, 2018, p. 20). Phraseological suggestions are also offered based on the user’s input, helping the writer to diversify the language he or she is using.

Picture 4
A screenshot of an example of a keyword-in-context free search, where the previous search suggested 13 related, inspiring English sources.

Phraseological support Ludwig.guru

3.3.2 Genre Writing Tutors That Use Built-In Corpora

Tools for genre writing pedagogy also use built-in corpora. These tools are mainly developed in academic contexts, with the aim to provide support for students when writing certain academic genres, such as bachelor thesis or research paper. Apart from various writing support functionalities, such as writing tutorials, or phrase banks, certain tools of this type incorporate a specialized corpus that users can search via an integrated corpus search function.

Thesis WriterFootnote 6 is a tool developed at Zurich University of Applied Sciences in Switzerland which assists economics students write their bachelor or master thesis in either German or English. The platform integrates an economics discipline-specific, open-source corpus that can be explored via a keyword-in-context free search. Students can explore various authentic language excerpts containing the search term (Picture 5). Additionally, related collocations can be retrieved by using the tool’s feature “Associated words.”

Picture 5
A window of the Thesis Writer displays the topic, research question, and relevance. The word corporate is searched, which displays results in a phrase or a sentence.

Thesis Writer. Examples of word use feature

The Research Writing Tutor (Cotos, 2014), thoroughly described in “Automated Feedback on Writing” also uses a specialized academic corpus. The multidisciplinary corpus, composed of “900 journal articles published in the top journals of 30 disciplines” (Cotos, 2017, p. 258), was manually annotated for rhetorical moves and steps. The moves were color-coded, and the steps were glossed. One module of the platform entitled “Explore Published Writing” gives access to the annotated texts and integrates a concordancer that can be used to search the corpus by move, step and discipline. In this way the users can get “examples of functional language indicative of the step’s rhetorical meaning” (Cotos et al., 2017, p. 110).

4 Future Developments

Although the modern writing research community is more and more aware of the potential of corpus research and applications for writing, there are still aspects that can make the collaboration between the two communities more effective. At this stage, it appears that there is the group of corpus linguists that performs linguistic analyses regarding L1 and L2 phenomena, which often include writing topics, and the group of writing research, which is interested in pedagogical concepts of writing, the writing processes that are associated with them or the socio-cultural writing embedment, which sometimes include corpora in their investigations. The synergy between these two areas can be improved by creating networking (e.g. common conferences and dedicated sessions in existing conferences) and dissemination opportunities (e.g. dedicated journals) in which mixed methods are encouraged.

Moreover, the field of computational sciences has become essential for further developments. This means that, if valuable improvements are to be made in the corpus use for writing studies and applications, IT specialists should be involved. Linguistics and writing departments should work more closely with the IT departments at the university or outside university. The same is valid for IT companies that develop writing apps: they should not ignore the importance of having linguists and writing specialists in their teams. This can make the difference between having a general use product that is limited in applicability and complex tools that address specific writing groups. Also, it is clear that the Artificial Intelligence corpus related methodologies are the future of writing support technologies: more and more linguistic data need automatic processing and evaluation, which cannot be performed via traditional methods any more.

Last but not least, resources that regard corpora and writing can be made more systematic and visible, with clearer indications on how to use them. Particular attention should be paid to updating corpus and tool lists and recommendations for specific writing interest groups. At the moment, there are disparate locations for such resources, such as: CLARIN (section: Language Resources [1]), Corpus Resource Database (CoRD) [2] or the webpage Corpus-Analysis [3].

5 Tools

No

Tool / Software

Description of the tool and underlying technology

Reference

URL if available

1

Antconc

freeware corpus analysis toolkit for corpus analysis; downloadable; versions for Windows, MacOS and Linux

Anthony (2022)

https://www.laurenceanthony.net/software

2

AWSuM

Web-based writing assistant for academic writing support; annotated for rhetorical moves

Atsushi (2017)

https://langtest.jp/awsum/

3

COCA

Web-based corpus platform; Corpus of Contemporary American English; free; log-in required

Davies (2009)

https://www.english-corpora.org/coca/

4

CROW

Web-based corpus platform; repository of learner writing; free; log-in required

Staples and Dilger (2018)

https://crow.corporaproject.org

5

English Corpora

Corpus overview portal; English language corpora

Davies (n.d.)

https://www.english-corpora.org

6

ICLE

Corpus databank; International Corpus of Learner English; commercial product (CD/DVD)

Granger et al. (2009)

https://www.i6doc.com/en/book/?GCOI=28001105280390

7

Lancsbox

Standalone software program for corpus analysis; downloadable; free

Brezina et al. (2020)

http://corpora.lancs.ac.uk/lancsbox

8

Ludwig.Guru

App and web-based interface (log-in required) for writing in English; sentence improvement options

Ludwig.guru (2022)

https://ludwig.guru/

9

Manchester Academic Phrasebank

Academic phrasebank webpage; English academic phrase lists; free

Morley (2018)

https://www.phrasebank.manchester.ac.uk/

10

Research Writing Tutor (RWT)

Annotated and pedagogically-mediated multi-disciplinary corpus; concordancer for rhetorical functions

Cotos (2014)

NA (Unavailable for external access)

11

ROGER

Web-based corpus platform; bilingual academic writing corpus for English and Romanian; novice academic writing; multi-disciplinary and multi-genre free; log-in required

Chitez et al. (2021)

https://roger-corpus.org/

12

Sketch Engine

Corpus query and management system; commercial product (annual user licences)

Kilgarriff et al.(2014)

https://www.sketchengine.eu/

13

Tools for Corpus Linguistics

Corpus tool portal; overview of corpus resources and their availability

Berberich and Kleiber (2020)

https://corpus-analysis.com/

14

Thesis Writer

Online learning environment for bachelor or master thesis in either German or English. Offers various support functions (tutorials, phrasebook, corpus search, collaboration, feedback, project management…)

Rapp and Kauf (2018)

https://thesiswriter.zhaw.ch/

15

Wordsmith

Corpus analysis software; English language specific; commercial product (permanent user licences)

Scott (2020)

https://lexically.net/wordsmith/