On Corpora and Writing

Chitez, Madalina; Dinca, Andreea

doi:10.1007/978-3-031-36033-6_24

5204 Accesses

Abstract

The chapter aims at providing an overview of the modalities in which linguistic corpora have been integrated in writing related approaches and technologies. The history of corpus linguistics is almost one century old, demonstrating a wide range of applications and interdisciplinary research potential. In this study, two main directions have been identified which describe approaches at the interface of corpora and writing. The first direction is represented by a large body of literature and refers to the use of corpora for academic writing research. The second direction focuses on the applicability of corpora for writing support, covering three aspects: (a) A section on corpora as a basis for primary linguistic tools for writers, which illustrates the use of corpora for the creation of dictionaries and phraseology lists. (b) A section on the use of corpora to teach academic writing. This section captures and exemplifies Data Driven Learning methods for corpus based academic writing and tools that support such approaches. (c) The third section refers to the use of built-in corpora for the creation of writing support tools (e.g. Ludwig.guru) or corpus related integrative tools (e.g. Thesis Writer).

You have full access to this open access chapter, Download chapter PDF

Keywords

1 Overview

1.1 Introduction

Language use and writing strategies are two inseparable facets of the same process: knowledge creation and sharing. In order to produce valuable pieces of writing, either creative or scientific in nature, writers of all ages and competence levels are challenged with tasks that range from simple word selection (Cameron & Dempsey, 2013) to adapting writing style to being formal or informal (Reppen et al., 2002). The first inventories of words were dictionaries which were structured archives of contextual uses of linguistic items present in the language at a certain point in time. They also served as the first linguistic research outcomes. In this context, the emergence of corpora seemed like the natural methodological evolution in language sampling and research. When lexicographers started collecting data for corpus based dictionaries (Teubert, 2007), their purpose was not only to disambiguate vocabulary terms and their meaning but also to provide lexical options based on authentic language samples (Hanks, 2009).

Being collections of naturally occurring samples of language, corpora represent reliable guidance resources for writers of all disciplines, genres and purposes. Beside instant access, as simple user, student or researcher, another applicability of corpora for language use in writing or writing process in general is the facilitation of digital tool creation:

People are not generally aware that computational linguists use corpora to develop all sorts of language tools that have become commonplace in our everyday lives, from simple spell checkers, to auto-correct options in word processors and web browsers, to sophisticated machine translation programs. (Frankenberg-Garcia, 2014)

Besides basic challenges such as choice of words, in the process of writing, a frequently encountered problem is the writer’s block, a phenomenon which is intrinsically cognitive (Hodges, 2017) but which can be overcome, oftentimes, through linguistic support. This can be automatic in nature, like paragraph generation (Duval et al., 2021) or support during the lexical refinement process (Baker-Brodersen, 1988). Such prompts are often based on corpora, and they are readily available online provided that the user is aware of the limitations of corpus queries (Kaltenböck & Mehlmauer-Larcher, 2005).

1.2 Evolution of Corpus Linguistics

Nowadays, corpora represent collections of texts that are collected, processed, analysed and exploited with the help of computer technology. But corpora have not always been digital and, as the name corpus implies, i.e. 'body' of language in Latin (Bondi, 2017, p. 46), they existed even before the advent of technology, when linguists used pre-computer corpora as a base for their linguistic studies (Biber & Reppen, 2015, p. 2). For example, when writing the Dictionary of the English Language, published in 1755, Samuel Johnson used around 150,000 natural sentences written on slips of paper to show the natural use of words (p. 2). Up to the 1960’s other noteworthy works include dictionaries, e.g. The Oxford English Dictionary published in 1928, empirical vocabulary studies, such as the General Service List (West, 1953), and grammar studies, as, for example, the two American English corpus based grammars by C. C. Fries published in 1940 and 1952 (Biber & Reppen, 2015, pp. 2–3) (Fig. 1).

An illustration of the history of corpus linguistics beginnings. It includes the Dictionary of the English Language, the Oxford English Dictionary, the American English Grammar, general service list, American English Grammar, followed by many dictionaries in between to the frown corpus. — **Fig. 1**

An important change occurred in the 1980’s when large electronic corpora became widely available and computational tools started to be used to perform linguistic analyses on that type of corpora (Biber & Reppen, 2015, p. 3). This gave rise to a flurry of linguistic studies using electronic corpora that focused on various linguistic features, ranging from lexis and grammar to register variation (pp. 3–4).

The two milestone corpora, Brown and LOB, have been paralleled by later versions, Frown (Freiburg-Brown corpus of American English) and FLOB (Freiburg-LOB Corpus of British English), initiated by Christian Mair at the University of Freiburg in Germany in 1991. The linguistic data in the later versions were meant to reflect the language development from the initial corpora (1960’s) to that time (1990’s).

Since then, the continuous advances in technology enabled the use of electronic corpora and corpus tools at a very large scale. At the moment, corpus linguistics is a well-established discipline, and its data analysis methods contribute to investigating language from various perspectives related to topics such as registers, dialects or entire languages (Egbert et al., 2020, p. 3).

1.3 Core Idea of the Technology

Corpora can be broadly defined as machine-readable sets of texts compiled following criteria that are analysed with the help of computer software, as they are too large to undergo manual analysis (McEnery & Hardie, 2012, pp. 1–2). The way a corpus is built is of very much importance, because a corpus should ideally “represent, as far as possible, a language or language variety as a source of data for linguistic research.” (Sinclair, 2005).

Web-based corpora, mainly composed of web pages, are the largest corpora available, containing billions of words. For example, the filtered version of the Common Crawl^{Footnote 1} used for the pre-training dataset for GPT-3 (i.e. Generative Pre-trained Transformer 3^{Footnote 2}) consists of 410 billion tokens. This large quantity of data enables powerful quantitative analyses. In addition, it is now possible to apply corpus linguistics methods to less structured language repositories, such as text archives (e.g. Lexis-Nexis, Google Books), or even the entire web. Common search interfaces allow basic queries that can yield linguistically relevant results. More powerful, however, are the so-called corpus architectures, which enable more complex queries usually found in corpus linguistics tools. Examples of corpus architectures include the googlebooks.byu.edu interface, which uses n-grams extracted from the Google Books, and the web-based tool Sketch Engine which, along a variety of other corpora, hosts several enormous web-based corpora that can be searched using all the tool’s features (Davies, 2015, pp. 19–22).

While web-based corpora are very successful in representing the genres normally found on the web, e.g. newspaper articles, they cannot offer a comprehensive picture of other language varieties, such as fiction or spoken language. General purpose genre-balanced corpora seem to be a good middle ground between size and representativeness. Corpora of this type contain sub-sections which are representative of several registers, and are also considerably large in size, so that powerful statistical analyses are supported. Two famous genre-balanced corpora are the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). COCA, which currently contains more than one billion words, is representative for eight registers including academic texts, speech, and fiction. New data is continuously being added to the corpus in a controlled manner, much attention being dedicated to preserving the genre balance in each subsection.

Even so, in certain cases, the language domain studied is composed of texts that are not found in general-purpose corpora. This, therefore, requires the usage of a specialized corpus, a type of corpus that represents as far as possible the full range of linguistic variation from a specific variety of language (Clancy, 2010, p. 82). The representativeness of the corpus is more important than its size, because it was proven that a well-designed specialized small corpus can provide more relevant results regarding “specialized lexis and structures” (O’Keeffe et al., 2007, p. 198) than a large corpus that was not customized to meet the researcher’s needs (Nesi, 2012, p. 408).

Yet, there are situations in which no ready-made corpus can meet the needs of a specific research question, and, in these cases, scholars need to compile new corpora. Sometimes called DIY corpora, these corpora are compiled in basic formats, e.g. txt files, and are smaller than ready-made specialized corpora, but because they contain only the language variety under investigation, their analysis yields valuable results (Nesi, 2012, p. 408). However, most of the time DIY corpora remain private due to copyright laws.

1.4 Processing and Tools

In order to apply corpus linguistics methods to a data set, several steps need to be taken. The corpus is first compiled, then the corpus data is annotated, and, finally, the corpus is analysed using corpus linguistics software (Rayson, 2015). Annotation is a procedure which “allows the researcher to encode linguistic information present in the corpus for later retrieval or extraction” (Rayson, 2015, p. 38). Certain types of annotation can be done automatically, while others are done manually. Automatic annotation with a high degree of accuracy has already been achieved for English (and other major languages) at the levels of: “morphology (prefix and suffix), lexical (part-of-speech and lemma), syntax (parsing)” sense) (Rayson, 2015, p. 39), and, in many cases, semantics (semantic field). However, one downside of automatic annotation is that it is not accurate enough for every language. Manual annotation, on the other hand, is done for areas not supported by automatic annotation, e.g. discourse (Rayson, 2015).

After having been compiled and annotated, the corpus can be searched using software tools for corpus analysis. The tools can be standalone software that one installs on their computer, e.g. Wordsmith, Antconc, Lancsbox. One important goal of these tools is to be user-friendly. However, they still require a learning curve, and this may discourage non-corpus linguists from using them.

1.5 Functional Specifications

The use of corpora equals, primarily, as previously mentioned, access to authentic language samples. Such access can be performed unsystematically, via large search engines (e.g. Google), or in a more structured manner, through dedicated corpus search platforms (e.g. COCA corpus platform). Nevertheless, users should consider the following types of shortcomings in relationship to both access situations: first, unsystematic databases contain linguistic information which is unfiltered, not properly verified nor structured; then, corpus platforms, while including linguistic information that has been collected according to specific representativeness criteria, are, quite frequently, not open source (i.e. licence-based).

Writing-focus tools are built and designed to integrate large amounts of linguistic information with the purpose of extracting statistically validated language patterns and offer context specific solutions. For example, such instruments can perform instant searches in their in-built corpora, select multiple word associations and generate best-matching collocation lists. This could benefit writers who are uncertain about the grammatical construction of a linguistic cluster, about the lexical choice within a structure or about the phraseological options to mark a specific rhetorical move.

The ultimate benefit of collecting large amounts of linguistic data is opening up immense possibilities for research and applications in areas at the intersection of Natural Language Processing and Artificial Intelligence. Because most large corpora nowadays can be easily compiled using web-scraping methods (see previous sections), computers can be trained to recognize linguistic patterns and predict others.

Since this latter aspect is quite vast and requires clarifications which are beyond the scope of explaining how corpus linguistics contributes to writing research and applications, in general, we exemplify such uses in the following two sections: corpus linguistics for writing studies and corpus related writing applications.

2 Corpus and Writing Research

2.1 Learner Corpora

The language produced by foreign, or second language learners is called learner language (Gilquin & Granger, 2015, p. 418) and it is investigated within a branch of corpus linguistics named Learner Corpus Research. Research in this field has provided valuable insight into various learner language areas, such as grammar, lexis, phraseology, various discourse phenomena and pragmatics. Since English is the preferred language for “international research and global communication” (Flowerdew, 2015, p. 466), and, as a consequence, non-native novice writers are required to master English academic writing norms, learner corpus research covers many aspects of writing in English. The learner corpora investigating English for Academic Purposes (EAP) are of two types: English for general academic purposes (EGAP) corpora and (2) English for specific academic purposes (ESAP) corpora. Corpora of the EGAP type contain writing common to multiple disciplines, such as argumentative essay writing on general topics, which, even if it is not discipline specific, helps students “practise the same rhetorical functions found in disciplinary writing” (Flowerdew, 2015, p. 468). One such corpus is the International Corpus of Learner English (ICLE) (Granger et al., 2009), which consists of essays written by undergraduate students, foreign or L2 learners of English from various L1 backgrounds.

The ESAP type of corpora usually contain texts which are representative of “written disciplinary genres that tertiary students have to master” (Flowerdew, 2015, p. 467), and contain sub-corpora divided by discipline or genres. Large-scale international corpus building initiatives exist, such as the Varieties of English for Specific Purposes (VESPA) and the Corpus of Academic Learner English (CALE). Both corpora aim at collecting texts from multiple L1s, disciplines and genres (Flowerdew, 2015, p. 468). Other ESAP learner corpora have been compiled for a specific ESP/EAP context, and they usually consist of texts from certain L1 users or from certain disciplines or genres. A case in point is the Romanian Genre Corpus / ROGER (Chitez et al., 2021), a comparable bilingual corpus which comprises university writing by L1 Romanian students, in their mother tongue and in English as a Foreign Language.

2.2 Research, Teaching and Development

As explained at the beginning of this chapter, corpus linguistics has become an independent and multivalent discipline which has attracted the attention of many researchers. With a history of almost a century, corpus based research has migrated from the field of linguistics towards interdisciplinary areas that centre round information technology. Corpus linguistics research is now performed at departments of modern languages and IT alike, with extensions towards Digital Humanities approaches. This multidisciplinary expansion has also been absorbed by teaching initiatives in all types of educational settings: pre-university language related approaches, university corpus based teaching and post-university further education programs. But the group that profits the most from the existence and improvement of corpus based writing research methods is the application and development group, represented by the applied research departments at universities and language related industry. It is now widely acknowledged that, by compiling linguistic datasets, practical tools and digital products can be developed which are supposed to improve processes in all sectors that involve linguistic analyses or language use, including writing. Numerous products (see Sect. 3) have been launched internationally and have billions of users.

3 Main Products

3.1 Corpora as a Basis for Primary Linguistic Tools for Writers

There are two main categories of corpus based primary linguistic tools for writers that have helped both expert and novice writers to foster their general or academic writing skills: dictionaries and phraseology databanks. The first category is fairly widespread and it is the main language instrument that students, teachers and general language users consult in order to validate their linguistic choices or search for refined alternatives. The inventory created by Frankenberg-Garcia (2014) includes the textbook and dictionary series of the five major UK academic publishers: Cambridge, Collins, Longman, Macmillan and Oxford. All of them have produced language support resources that target the general language user (e.g. Cambridge Dictionary of American English), the grammar rule seeker (e.g. Cambridge Grammar of English), the L2 English language user (e.g. Collins COBUILD English Dictionary for Advanced Learners) or the writing challenged user (e.g. Macmillan Collocations Dictionary). The Cambridge series is quite rich with books from the following internationally used series: Cambridge Dictionary of American English, Cambridge International Dictionary of English, Cambridge Grammar of English, Cambridge Learner Corpus, Touchstone series, Vocabulary in Use series. They are based on the Cambridge English Corpus, which includes all the words at CEFR levels A1–C2. The Cambridge corpus based language aids are mainly used by those who want to write in a native-like manner.

A screenshot of example of a phrase from the academic phrase bank that is compare and contrast which gives a description of the phrase. — **Picture 1**

In the second category, a valuable academic writing resource that has corpus based research at its roots is the Academic Phrasebank (Morley, 2018), developed at the University of Manchester (Picture 1). The phrases have been ‘harvested’ (Morley, 2018, p. 4) from a corpus consisting of “100 postgraduate dissertations completed at the University of Manchester” while “phrases from academic articles drawn from a broad spectrum of disciplines have also been, and continue to be, incorporated” (p. 4).

3.2 Corpus Based Data Driven Learning

The use of linguistic corpora has not been limited to research in the field of corpus linguistics, but instead it has become an indispensable practice in all language related areas such as translation studies, applied linguistics, sociolinguistics or language teaching. The use of linguistic corpora has garnered the interest of researchers, teachers and students alike (Boulton & Tyne, 2013; Tribble, 2002). Corpus based teaching activities have proven to have a positive effect on the students’ linguistic competences, as their writing improves at multiple levels, such as, for example, lexico-grammatical features (Boulton & Tyne, 2013; Chitez & Bercuci, 2019; Cortes, 2007; Levchenko, 2017; O’Sullivan, 2010).

The study by Tatyana Karpenko-Seccombe (2020), Academic Writing with Corpora: A Resource Book for Data Driven Learning, introduces the latest corpora-based resources suitable for teachers and students interested in language and writing improvement. Beside introducing various online corpora and several free-to-use tools, the book also provides practical examples of corpus based language acquisition improvements and shows the practicality of corpora in improving academic writing, both at micro (e.g., argumentative writing) and macro (e.g., writing a literature review) levels.

Most corpora in English can be used in classroom activities for teaching academic writing, whether as general/reference or specialized corpora. Many such resources are readily available online on websites comparable to https://www.english-corpora.org/, the largest and most frequented online resource of English-language corpora. For example, COCA (Corpus of Contemporary American English) has been used by Chang (2014) alongside a private specialized corpus (Michelangelo) to improve the students’ writing in ESL (English Second Language). Likewise, the BNC (British National Corpus) and iWEB (formerly BYU Corpora) have been used productively by Khan (2019) to teach academic lexical bundles to ESL students. As far as specialized corpora are concerned, MICUSP (Michigan Corpus of Upper-Level Student Papers) has been used by Ädel (2010) to effectively introduce students to rhetorical moves in academic writing. Similarly, The ICLE corpus family—International Corpus of Learner English (Granger, 2003) has been used in numerous studies (e.g. McEnery et al., 2019) to analyse interlanguage phenomena or extract potential learner error areas that can be exploited pedagogically. More recent academic writing databases are: CROW (Corpus and repository of writing) (Staples & Dilger, 2018), containing US college writing samples, and ROGER (Corpus of Romanian Academic Genres) (Chitez et al., 2021), containing university students writing in Romanian L1 and English L2.

As many experts note, one of the most successful methods of integrating corpora in teaching academic writing was by having students create their own specialized corpora (Chang, 2014; Cortes, 2007; Levchenko, 2017; Yoon, 2008). To this end, there are undeniable benefits of user-friendly software that can be used in corpora-based teaching activities for academic writing classes both by teachers and their students. Standard corpus analysis tools are the free-to-use #Lancsbox (Brezina et al., 2020) and AntConc (Anthony, 2022), the available-for-purchase WordSmith Tools (Scott, 2020) and many others that are mentioned on the webpage Tools for Corpus Linguistics.^{Footnote 3}

3.3 The Use of Built-In Corpora in Writing Tools

3.3.1 Corpus Based Writing Improvement Tools

Corpus based writing improvement tools integrate searches specific to corpus linguistics into user-friendly, web-based platforms. In other words, users can perform linguistic searches in a variety of corpora hosted by the platform. Some of these platforms are commercial (e.g. Ludwig.guru^{Footnote 4}) and others are developed in academic contexts (e.g. AWSuM^{Footnote 5}). The commercial platforms address multiple audiences, such as scholars, students, or professionals, whereas the academic tools target academic oriented audiences, such as students or researchers.

The target audience influences the corpus data contained by the platforms. Ludwig.guru, directed towards several audiences, hosts a variety of corpora divided into several categories based on register (e.g. News and Media, Science and Research or Formal and Business). In addition, users can create a corpus with their own linguistic data. By contrast, AWSuM, directed towards an academic audience, contains a corpus of academic writing, divided into two datasets made up of published research articles from two disciplines: Applied Linguistics and Computer Science. One important advantage of the AWSuM corpus is that it has been annotated for rhetorical moves (Atsushi, 2017).

Ludwig.guru provides several corpus based search features. Basic and complex free searches in the corpora hosted by the platform can be performed. The user can input a search word or phrase and explore their use in a variety of authentic language contexts (Picture 2). An example of a complex free search is the use of the wildcard “_”, through which the user gets synonym suggestions for a certain word in a phrase, as shown in Picture 3.

A window displays a complex free search, exemplified by a search through which the user gets synonym suggestions for a search word in a phrase. — **Picture 2**

A window displays the frequency comparisons of two words or two sentences in a phrase. — **Picture 3**

In addition, the frequency of two words or two sentences can be compared (Picture 4). This can be useful when the writer is unsure of the structure of a multi-word unit or what words that have a similar meaning are preferred in a certain register (Charles, 2018, p. 20). Phraseological suggestions are also offered based on the user’s input, helping the writer to diversify the language he or she is using.

A screenshot of an example of a keyword-in-context free search, where the previous search suggested 13 related, inspiring English sources. — **Picture 4**

3.3.2 Genre Writing Tutors That Use Built-In Corpora

Tools for genre writing pedagogy also use built-in corpora. These tools are mainly developed in academic contexts, with the aim to provide support for students when writing certain academic genres, such as bachelor thesis or research paper. Apart from various writing support functionalities, such as writing tutorials, or phrase banks, certain tools of this type incorporate a specialized corpus that users can search via an integrated corpus search function.

Thesis Writer^{Footnote 6} is a tool developed at Zurich University of Applied Sciences in Switzerland which assists economics students write their bachelor or master thesis in either German or English. The platform integrates an economics discipline-specific, open-source corpus that can be explored via a keyword-in-context free search. Students can explore various authentic language excerpts containing the search term (Picture 5). Additionally, related collocations can be retrieved by using the tool’s feature “Associated words.”

A window of the Thesis Writer displays the topic, research question, and relevance. The word corporate is searched, which displays results in a phrase or a sentence. — **Picture 5**

The Research Writing Tutor (Cotos, 2014), thoroughly described in “Automated Feedback on Writing” also uses a specialized academic corpus. The multidisciplinary corpus, composed of “900 journal articles published in the top journals of 30 disciplines” (Cotos, 2017, p. 258), was manually annotated for rhetorical moves and steps. The moves were color-coded, and the steps were glossed. One module of the platform entitled “Explore Published Writing” gives access to the annotated texts and integrates a concordancer that can be used to search the corpus by move, step and discipline. In this way the users can get “examples of functional language indicative of the step’s rhetorical meaning” (Cotos et al., 2017, p. 110).

4 Future Developments

Although the modern writing research community is more and more aware of the potential of corpus research and applications for writing, there are still aspects that can make the collaboration between the two communities more effective. At this stage, it appears that there is the group of corpus linguists that performs linguistic analyses regarding L1 and L2 phenomena, which often include writing topics, and the group of writing research, which is interested in pedagogical concepts of writing, the writing processes that are associated with them or the socio-cultural writing embedment, which sometimes include corpora in their investigations. The synergy between these two areas can be improved by creating networking (e.g. common conferences and dedicated sessions in existing conferences) and dissemination opportunities (e.g. dedicated journals) in which mixed methods are encouraged.

Moreover, the field of computational sciences has become essential for further developments. This means that, if valuable improvements are to be made in the corpus use for writing studies and applications, IT specialists should be involved. Linguistics and writing departments should work more closely with the IT departments at the university or outside university. The same is valid for IT companies that develop writing apps: they should not ignore the importance of having linguists and writing specialists in their teams. This can make the difference between having a general use product that is limited in applicability and complex tools that address specific writing groups. Also, it is clear that the Artificial Intelligence corpus related methodologies are the future of writing support technologies: more and more linguistic data need automatic processing and evaluation, which cannot be performed via traditional methods any more.

Last but not least, resources that regard corpora and writing can be made more systematic and visible, with clearer indications on how to use them. Particular attention should be paid to updating corpus and tool lists and recommendations for specific writing interest groups. At the moment, there are disparate locations for such resources, such as: CLARIN (section: Language Resources [1]), Corpus Resource Database (CoRD) [2] or the webpage Corpus-Analysis [3].

[1] More information at: https://www.clarin.eu/content/language-resources
[2] More information at: https://varieng.helsinki.fi/CoRD/
[3] More information at: https://corpus-analysis.com/

5 Tools

No	Tool / Software	Description of the tool and underlying technology	Reference	URL if available
1	Antconc	freeware corpus analysis toolkit for corpus analysis; downloadable; versions for Windows, MacOS and Linux	Anthony (2022)	https://www.laurenceanthony.net/software
2	AWSuM	Web-based writing assistant for academic writing support; annotated for rhetorical moves	Atsushi (2017)	https://langtest.jp/awsum/
3	COCA	Web-based corpus platform; Corpus of Contemporary American English; free; log-in required	Davies (2009)	https://www.english-corpora.org/coca/
4	CROW	Web-based corpus platform; repository of learner writing; free; log-in required	Staples and Dilger (2018)	https://crow.corporaproject.org
5	English Corpora	Corpus overview portal; English language corpora	Davies (n.d.)	https://www.english-corpora.org
6	ICLE	Corpus databank; International Corpus of Learner English; commercial product (CD/DVD)	Granger et al. (2009)	https://www.i6doc.com/en/book/?GCOI=28001105280390
7	Lancsbox	Standalone software program for corpus analysis; downloadable; free	Brezina et al. (2020)	http://corpora.lancs.ac.uk/lancsbox
8	Ludwig.Guru	App and web-based interface (log-in required) for writing in English; sentence improvement options	Ludwig.guru (2022)	https://ludwig.guru/
9	Manchester Academic Phrasebank	Academic phrasebank webpage; English academic phrase lists; free	Morley (2018)	https://www.phrasebank.manchester.ac.uk/
10	Research Writing Tutor (RWT)	Annotated and pedagogically-mediated multi-disciplinary corpus; concordancer for rhetorical functions	Cotos (2014)	NA (Unavailable for external access)
11	ROGER	Web-based corpus platform; bilingual academic writing corpus for English and Romanian; novice academic writing; multi-disciplinary and multi-genre free; log-in required	Chitez et al. (2021)	https://roger-corpus.org/
12	Sketch Engine	Corpus query and management system; commercial product (annual user licences)	Kilgarriff et al.(2014)	https://www.sketchengine.eu/
13	Tools for Corpus Linguistics	Corpus tool portal; overview of corpus resources and their availability	Berberich and Kleiber (2020)	https://corpus-analysis.com/
14	Thesis Writer	Online learning environment for bachelor or master thesis in either German or English. Offers various support functions (tutorials, phrasebook, corpus search, collaboration, feedback, project management…)	Rapp and Kauf (2018)	https://thesiswriter.zhaw.ch/
15	Wordsmith	Corpus analysis software; English language specific; commercial product (permanent user licences)	Scott (2020)	https://lexically.net/wordsmith/

Notes

References

Ädel, A. (2010). Using corpora to teach academic writing: Challenges for the direct approach. In M. C. Campoy, B. Belles-Fortuno, & M. L. Gea-Valor (Eds.), Corpus based approaches to English language teaching (pp. 39–55). Bloomsbury.
Google Scholar
Anthony, L. (2022). AntConc (Version 4.0.5) [Computer Software]. Waseda University. https://www.laurenceanthony.net/software
Atsushi, M. (2017). AWSuM User’s Manual. https://langtest.jp/awsum/manual/AWSuM-Manual_E.pdf
Baker-Brodersen, E. M. (1988). Writer's block and a cognitive process model of composing: recent research and implications for teaching (Master's thesis). https://dr.lib.iastate.edu/handle/20.500.12876/69942
Berberich, K., & Kleiber, I. (2020). Tools for Corpus Linguistics. https://corpus-analysis.com/
Biber, D., & Reppen, R. (2015). Introduction. In D. Biber & R. Reppen (Eds.), The Cambridge handbook of English corpus linguistics (pp. 1–8). Cambridge University Press. https://doi.org/10.1017/CBO9781139764377.001
Bondi, M. (2017). Corpus linguistics. In E. Weigand (Ed.), The Routledge handbook of language and dialogue (pp. 46–61). Routledge.
Google Scholar
Boulton, A., & Tyne, H. (2013). Corpus linguistics and data-driven learning: A critical overview. Bulletin Suisse De Linguistique Appliquée, 97, 97–118.
Google Scholar
Brezina, V., Weill-Tessier, P., & McEnery, A. (2020). #LancsBox v. 5.x. [Computer Software]. http://corpora.lancs.ac.uk/lancsbox.
Cameron, S., & Dempsey, L. (2013). Focusing on word choice in writing. Practically Primary, 18(3), 37–38.
Google Scholar
Chang, J. Y. (2014). The use of general and specialized corpora as reference sources for academic English writing: A case study. ReCALL, 26(2), 243–259. https://doi.org/10.1017/S0958344014000056
Article Google Scholar
Charles, M. (2018). Corpus-assisted editing for doctoral students: More than just concordancing. Journal of English for Academic Purposes, 36, 15–25. https://doi.org/10.1016/j.jeap.2018.08.003
Article Google Scholar
Chitez, M., & Bercuci, L. (2019). Data-driven learning in ESP university settings in Romania: multiple corpus consultation approaches for academic writing support. In F. Meunier, J. Van de Vyver, L. Bradley & S. Thouësny (Eds), CALL and complexity—Short papers from EUROCALL 2019, 75–81. Research-publishing.net. https://doi.org/10.14705/rpnet.2019.38.989
Chitez, M., Bercuci, L., Dincă, A., Rogobete, R., & Csürös, K. (2021). Corpus of Romanian Academic Genres (ROGER) [Data Set]. https://roger-corpus.org/
Clancy, B. (2010). Building a corpus to represent a variety of a language. In A. O’Keeffe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 80–92). Routledge.
Google Scholar
Cortes, V. (2007). Exploring genre and corpora in the English for academic writing class. ORTESOL Journal, 25, 8–14.
Google Scholar
Cotos, E. (2014). From prototyping to principled practical realization. In E. Cotos, Genre-based automated writing evaluation for L2 research writing: From design to evaluation and enhancement. Palgrave Macmillan.
Google Scholar
Cotos, E. (2017). Language for specific purposes and corpus based pedagogy. In C. A. Chapelle & S. Sauro (Eds.), The handbook of technology and second language teaching and learning (pp. 248–264). John Wiley & Sons. https://doi.org/10.1002/9781118914069.ch17
Cotos, E., Link, S., & Huffman, S. (2017). Effects of technology on genre learning. Language Learning & Technology, 21(3), 104–130. http://llt.msu.edu/issues/october2017/cotoslinkhuffman.pdf
Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190. https://doi.org/10.1075/ijcl.14.2.02dav
Article Google Scholar
Davies, M. (2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), The Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press. https://doi.org/10.1017/CBO9781139764377
Davies, M. (n.d.). English corpora. https://www.english-corpora.org
Duval, A., Lamson, T., de Kérouara, G. D. L., & Gallé, M. (2020). Breaking writer's block: Low-cost fine-tuning of natural language generation models. arXiv preprint. https://arxiv.org/abs/2101.03216
Egbert, J., Larsson, T., & Biber, D. (2020). Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User. Cambridge University Press.
Google Scholar
Flowerdew, L. (2015). Learner corpora and language for academic and specific purposes. In F. Meunier, G. Gilquin, & S. Granger (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 465–484). Cambridge University Press. https://doi.org/10.1017/CBO9781139649414
Frankenberg-Garcia, A. (2014). How language learners can benefit from corpora, or not. Recherches en didactique des langues et des cultures, 11(1). https://doi.org/10.4000/rdlc.1702
Gilquin, G., Granger, S., & Paquot, M. (2007). Learner Corpora: The missing link in EAP pedagogy. Journal of English for Academic Purposes, 6(4), 319–335. https://doi.org/10.1016/j.jeap.2007.09.007
Article Google Scholar
Gilquin, G., & Granger, S. (2015). Learner language. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of Corpus Linguistics (pp. 418–436). Cambridge University Press. https://doi.org/10.1017/CBO9781139764377
Granger, S. (2003). The international corpus of learner English: A new resource for foreign language learning and teaching and second language acquisition research. Tesol Quarterly, 37(3), 538–546. https://doi.org/10.2307/3588404
Article Google Scholar
Granger, S., Dagneaux., E., Meunier, F., & Paquot M., (2009). International Corpus of Learner English. Handbook and CD-ROM. Version 2. Presses universitaires de Louvain.
Google Scholar
Granger, S., Gilquin, G., & Meunier, F. (Eds.). (2015). The Cambridge handbook of learner corpus research. Cambridge University Press. https://doi.org/10.1017/CBO9781139649414
Hanks, P. (2009). The impact of corpora on dictionaries. In P. Baker (Ed.), Contemporary Corpus Linguistics. Continuum (pp. 214–236). Continuum.
Google Scholar
Hasselgård, H. (2019). Phraseological teddy bears: Frequent lexical bundles in academic writing by Norwegian learners and native speakers of English. In V. Wiegand & M. Mahlberg (Eds.), Corpus Linguistics, Context and Culture (pp. 339–362). De Gruyter.
Chapter Google Scholar
Hodges, T. S. (2017). Theoretically speaking: An examination of four theories and how they support writing in the classroom. The Clearing House: A Journal of Educational Strategies, Issues and Ideas, 90(4), 139–146. https://doi.org/10.1080/00098655.2017.1326228
Article Google Scholar
Kaltenböck, G., & Mehlmauer-Larcher, B. (2005). Computer corpora and the language classroom: On the potential and limitations of computer corpora in language teaching. ReCALL, 17(1), 65–84. https://doi.org/10.1017/S0958344005000613
Article Google Scholar
Karpenko-Seccombe, T. (2020). Academic writing with corpora: A resource book for data-driven learning. Routledge.
Google Scholar
Khan, M. A. (2019). New ways of using corpora for teaching vocabulary and writing in the ESL classroom. ORTESOL Journal, 36, 17–24.
Google Scholar
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The sketch engine: Ten years on. Lexicography, 1(1), 7–36. https://doi.org/10.1007/s40607-014-0009-9
Article Google Scholar
Levchenko, V. (2017). Use of Corpus based classroom activities in developing academic awareness in doctoral students. The New Educational Review, 48(1), 28–40. https://doi.org/10.15804/tner.2017.48.2.02
Article Google Scholar
Ludwig.guru. (2022). Ludwig.guru [Computer Software]. https://ludwig.guru/
McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method. Cambridge University Press.
Google Scholar
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 39, 74–92. https://doi.org/10.1017/S0267190519000096
Article Google Scholar
Morley, J. (2018). Academic phrasebank: A compendium of commonly used phrasal elements in academic English in PDF format. University of Manchester. https://www.phrasebank.manchester.ac.uk/
Nesi, H. (2012). ESP and corpus studies. In B. Paltridge & S. Starfield (Eds.), The handbook of English for specific purposes (pp. 407–426). John Wiley & Sons.https://doi.org/10.1002/9781118339855.ch21
O’Keeffe, A., McCarthy, M. J., & Carter, R. A. (2007). From corpus to classroom. Cambridge University Press. https://doi.org/10.1017/CBO9780511497650
Article Google Scholar
O’Sullivan, Í. (2010). Using corpora to enhance learners’ academic writing skills in French. Revue Française De Linguistique Appliquée, 15(2), 21–35. https://doi.org/10.3917/rfla.152.0021
Article Google Scholar
Rapp, C., & Kauf, P. (2018). Scaling academic writing instruction: Evaluation of a scaffolding tool (Thesis Writer). International Journal of Artificial Intelligence in Education, 28(4), 590–615. https://doi.org/10.1007/s40593-017-0162-z
Article Google Scholar
Rayson, P. (2015). Computational tools and methods for corpus compilation and analysis. In D. Biber & R. Reppen (Eds.), The Cambridge Handbook of English Corpus Linguistics (pp. 32–49). Cambridge University Press. https://doi.org/10.1017/CBO9781139764377
Reppen, R., Fitzmaurice, S. M., & Biber, D. (Eds.). (2002). Using corpora to explore linguistic variation. John Benjamins Publishing. https://doi.org/10.1075/scl.9
Scott, M. (2020). WordSmith Tools (Version 8). [Computer Software]. Lexical Analysis Software. https://lexically.net/wordsmith/
Sinclair, J. (2005). Corpus and text—Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice. Oxbow Books: 1–16. http://ota.ox.ac.uk/documents/creating/dlc/
Staples, S., & Dilger, B. (2018). Corpus and repository of writing (CROW). https://crow.corporaproject.org
Teubert, W. (2007). Corpus linguistics and lexicography. In Teubert, W. (Ed.). Text corpora and multilingual lexicography (pp. 109–133). John Benjamins. https://doi.org/10.1075/bct.8
Tribble, C. (2002). Corpora and corpus analysis: New windows on academic writing. In J. Flowerdew (Ed.), Academic discourse (1st ed.) (pp. 141–159). Routledge. https://doi.org/10.4324/9781315838069
West, M. (1953). A general service list of English words: With semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longman.
Google Scholar
Yoon, H. (2008). More than a linguistic reference: The influence of corpus technology on L2 academic writing. Language Learning & Technology, 12(2), 31–48. http://dx.doi.org/10125/44142

Download references

Author information

Authors and Affiliations

West University of Timisoara, Timișoara, Romania
Madalina Chitez & Andreea Dinca

Authors

Madalina Chitez
View author publications
You can also search for this author in PubMed Google Scholar
Andreea Dinca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Madalina Chitez .

Editor information

Editors and Affiliations

School of Applied Linguistics, Zurich University of Applied Sciences, Winterthur, Switzerland
Otto Kruse
School of Management and Law, Center for Innovative Teaching and Learning, Zurich University of Applied Sciences, Winterthur, Switzerland
Christian Rapp
North Carolina State University, Raleigh, NC, USA
Chris M. Anson
TECFA, Faculty of Psychology and Educational Sciences, University of Geneva, Geneva, Switzerland
Kalliopi Benetos
English Department, Iowa State University, Ames, IA, USA
Elena Cotos
School of Education, Trinity College Dublin, Dublin, Ireland
Ann Devitt
TD School, University of Technology Sydney, Sydney, NSW, Australia
Antonette Shibani

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chitez, M., Dinca, A. (2023). On Corpora and Writing. In: Kruse, O., et al. Digital Writing Technologies in Higher Education . Springer, Cham. https://doi.org/10.1007/978-3-031-36033-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-36033-6_24
Published: 15 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36032-9
Online ISBN: 978-3-031-36033-6
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics