1 Introduction

In approaching a collection of texts, it is very natural to ask the question: what kinds of texts does it contain? Attempts to categorize texts by their genre go back to Aristotle (Santini et al. 2010). Detecting the genre of a text is beneficial in many areas of Natural Language Processing. For example, in POS tagging and discourse annotation, knowing the genre of a document can help in selecting appropriate language models. Thus, Giesbrecht and Evert (2009) showed the impact of genre on POS tagging performance. Their POS tagger achieves 96.9 % accuracy on newspaper texts whereas it reaches only 85.7 % accuracy on forums. Webber (2009) showed that genres such as letters to the editor vs. newspaper articles differ in the distribution of discourse relations. Genre detection for web texts can also be helpful in information retrieval: Vidulin et al. (2007) make the point that it is difficult for search engine users to find relevant pages that are in the right genres, when starting from standard topical queries.

Realizing this need for genre annotation, even the Brown Corpus, the very first large computerized corpus created in the 1960s, was based on classification of texts into 15 categories, roughly corresponding to genres, such as Press:Reportage, Press:Editorial, Fiction:Adventure, or Fiction:Love and Romance (Kučera and Francis 1967). The British National Corpus (BNC) contains classification of texts according to a range of genre-related parameters, such as the type of publication (e.g., book or newspaper), audience (specialists or lay persons), as well as an explicit genre classification designed by Lee (2001). With the arrival of the web, it became much easier to collect large corpora. The web also resulted in new genres not available before, such as blogs or Internet shopping sites. However, for many genres which feel unique to the web, there are earlier precursors: for example, one could argue that (personal) blogs have similarities to published personal diaries. Section 2 will review more closely the concept of genre and the relations between web and traditional genres.

The interest in the web and its genres (Mehler et al. 2010) resulted in a proliferation of genre-annotated web corpora, each of which was built according to its specific principles, using its own classification scheme and annotation guidelines. Problematically, these corpora are either not tested for annotation reliability as the focus of work was elsewhere or exhibit low inter-annotator agreement. Rehm et al. (2008) already call for a reliably annotated web genre corpus, preferably using a random snapshot of the web, but do not present an actual corpus. This paper takes steps to remedy this research gap. After reviewing prior web genre corpora in Sect. 3, we summarize their shortcomings: these include reliability problems, provision of few pages for many genre classes as well as the occasional lack in source and topic diversity and appropriate storage formats. We suggest that crowd-sourcing is the appropriate method to develop a web genre corpus with high inter-annotator reliability because it allows speedy, accurate and inexpensive genre annotation that detaches the annotation proper from the potential bias of the expert team who developed the guidelines [see also Riezler (2014) for discussing the potential circularity if the same team develops guidelines/terms and annotates].

We then present the Leeds Web Genre Corpus (LWGC) that identifies 15 genre classes reliably via crowd-sourcing. Our genre inventory is detailed in Sect. 5. The LWGC consists of two sub-corpora: The first one (LWGC-B(alanced)) is a designed corpus, where web pages were collected using focused search for specific genres by following links in available web directories before them being submitted to the crowd-sourcing annotation. This method allows us to test our annotation method on a set of web pages with little noise. In addition, it leads to a balanced distribution of genres in the corpus, which is ideal for automatic genre identification via machine learning methods that need sufficient training material for each genre—a property that many existing collections lack. In addition, we collect the corpus from a wide variety of sources, circumventing spurious topic-genre correlations existing in some prior corpora. The LWGC-B(alanced) is described in Sect. 6. Our second corpus (LWGC-R(andom)) then expands our method successfully to a corpus where the pages to be annotated are collected in a more arbitrary way among web pages returned by search engines. The LWGC-R(andom) corpus is described in Sect. 7. This sub-corpus also allowed us to investigate and expand coverage of the underlying genre inventory. However, the emphasis of our paper is not on completeness of the genre inventory but on genre annotation methodology.

Our main contribution is therefore the development of a crowd-sourcing genre annotation method which leads to the first web genre corpus with all of the following properties: demonstrably high inter-annotator agreement, regardless of web page provenance, and achievable by non-expert annotators; a large number of web pages per category; source and topic diversity.

2 The concept of genre

Genre definitions. Many researchers have studied the notion of genre, mostly concentrating on the role that the form and the function of a document play in defining genre. As an example, Campbell and Jamieson (1978) defined genre as:

a group of acts unified by a constellation of forms that recurs in each of its members. These forms, in isolation, appear in other discourses. What is distinctive about the acts in genre is a recurrence of the forms together in constellation. (Campbell and Jamieson 1978, p. 20)

In this definition, the emphasis is on a document’s form. In contrast, Miller (1984, p. 159) argues that the definition of genre must not be limited to the form of the discourse only, but it should also include “the action it is used to accomplish”. In other words, texts in a genre class have the same purpose or function as well as similar patterns of form. Biber (1991) also emphasizes the importance of purpose in recognizing a genre class by stating:

I use the term genre to refer to text categorizations made on the basis of external criteria relating to author/speaker purpose.  (Biber 1991, p. 68)

 Swales (1990)’s definition of genre is in line with Biber’s as he also recognizes “purpose” as the principle attribute that instances in a genre class share.

We follow Orlikowski and Yates (1994) who use a more comprehensive definition of genre, which combines function and form:

a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form.  (Orlikowski and Yates 1994, p. 543)

Orlikowski and Yates’s (1994) definition also adds a new dimension by clearly stressing that genres must be socially recognizable. In other words, genre classes exist only if they are identifiable by people in society (Andersen 2008).

Web genres and traditional genres. Since this paper focuses on genres on the web, it is important to compare web genres with genres in traditional media. The World Wide Web, which was created in 1989, is a communication medium for retrieving and displaying multimedia hypertext documents (Berners-Lee et al. 1994).

Yates et al. (1997) recognized the advent of a new communications medium as one of the reasons for the emergence of variants of existing genres or of new genres. Shepherd and Watters (1998) introduced the notion of cybergenre and proposed a hierarchical taxonomy for classifying the genres on the web compared to traditional genres. According to this classification, cybergenres can be extant (i.e.“ based on existing genres”) or novel (i.e. “not like any existing genre in any other medium”). They give on-line newspapers as an example of extant genres and personal homepages as an example of novel genres.Footnote 1 Extant genres are divided into two sub-classes: replicated (i.e.“based on genres existing in other media”) and variant (i.e.“a modification of existing genres”). Novel genres are also separated into two groups: emergent (i.e. “derived but significantly different from existing genres”), and spontaneous ( i.e. “never employed in other media”). They refer to personalized newspapers and frequently asked questions as examples of emergent and spontaneous genres, respectively.

Crowston and Williams (2000) proposed a similar categorization for web genres. They conducted a survey on 1000 random web pages and distinguish four different types of genres: reproduced, adapted and novel genres as well as unclassified web pages (see Table 1). Reproduced genres replicate genres in traditional media to a great extent and were found to be the most frequent type (60.6 %). The second type (adapted genres) evolved from existing genres in the paper world by using the capability of the new medium. For example, a list of items which makes use of the hyper-link capability of the web to link to other pages is creating both a list and an index. As a third type they note novel genres exclusive to the web such as home pages. Although the proportion of novel genres in this study is very low, it is possible that nowadays this group of genres comprises a bigger percentage due to additional frequent genres such as microblogs. Pages remained unclassified due to two main reasons: not knowing the name of the genre and the difficulty of determining the purpose of the web page. Some of these unclassified web pages could be examples of genres still in formation. Therefore, in the process of building a genre-annotated web corpus, we would expect to find some web pages without any genre label.

Table 1 Percentage of types of genres found by Crowston and Williams (2000)

3 Existing genre-annotated web corpora

Several efforts have been made to build genre annotated web corpora and to employ them for research in the field of automatic genre identification (AGI). But each collection is different in terms of the size of the corpus, collection methods and web page storage format. In addition, there is no agreed set of genre labels so that each collections’ labels vary according to researchers’ priorities and the genre definition chosen (see also Sect. 2). In the following, we give a short description of each genre collection, after which we summarize some characteristics all of them share. Table 2 gives an overview of the properties of these corpora.

The hierarchical genre collection (HGC) (Stubbe and Ringlstetter 2007), the Syracuse corpus (Crowston et al. 2011), KRYS I (Berninger et al. 2008) and the corpus constructed in Egbert and Biber (2013), Egbert et al. (2015) use a relatively large number of genre labels (between 32 and 292 labels), leading to high granularity. Their focus is therefore on high coverage and the construction of a detailed taxonomy. HGC, KRYS I and Egbert et al. (2013, 2015) use a hierarchical structure of genre labels so that also a more coarse-grained classification is available.Footnote 2 All of them use labels influenced by both form and function of the document, although some labels used relate only to document function or even to document medium, especially in the coarse-grained classification level. This is especially true for Egbert et al. (2013, 2015). More details on each corpus follow.

The hierarchical genre collection (HGC) (Stubbe and Ringlstetter 2007) was annotated using hierarchical genre labels with seven main categories and thirty-two sub categories, e.g., literature as a main category with the subcategories poem, prose and drama. This collection consists of 1280 web pages preserved in HTML format. For each genre category, forty prototypical pages were manually collected.

The KRYS I (Berninger et al. 2008) collection consists of 6200 PDF documents. This corpus has been annotated using seventy genres which are grouped into ten coarse classes, e.g. Commentary and Review in the Journalism group. Although this selection is meant to be a genre-annotated web corpus, it includes only web pages in PDF format. Therefore, genres that do not normally use this format, such as homepage and shop, are not included.

The Syracuse (Crowston et al. 2011) collection consists of 3027 web pages annotated based on 292 very specific genres. The genre palette in this collection was developed bottom-up by asking three groups of people (teachers, journalists, engineers) to produce web genre terms themselves.

The corpus constructed in Egbert and Biber (2013) has 1000 random web pages categorized into eight very broad, mainly functionally defined genres or registers (e.g. description, discussion and opinion) and 56 sub-registers (which use both form and function). This corpus was annotated via Amazon Mechanical Turk which is a crowd-sourcing website. Later, this project was extended to 53,000 web pages in Egbert et al. (2015). Therefore, their work is the work most similar to ours with regards to annotation methodology. However, they have a stronger focus on coverage whereas we focus on annotation reproducibility, which is low in their work (see Table 2 and further discussion below on reliability).

Then there is a group of corpora with smaller sets of genre labels, either because the researchers focus less on coverage but more on genres which are of interest to them for a certain application or task (KI-04, SANTINIS) or because the authors attempt to achieve high coverage via a broad set of often purely functional labels without further subdivision (I-EN-SAMPLE, MGC to a degree). We will discuss these next.

KI-04 (Meyer zu Eissen and Stein 2004) and SANTINIS (Santini 2007) are the corpora that are most often used in automatic genre identification work. Their categories are motivated by web search use and web specificity, respectively. KI-04 (Meyer zu Eissen and Stein 2004) contains 1205 HTML documents annotated using eight genres, e.g., link collection, shop and articles. The genre list in this collection focuses on including genre classes that are most useful for web search—it was developed by asking a group of students to fill in a questionnaire about typical topics for queries and favourite genre classes. As can be seen, the resulting classes are of quite differing granularity. SANTINIS (Santini 2007) corpus, which consists of 1400 web pages, was annotated based on seven genres. This collection focused on genres which are exclusive to the web, e.g. blog and FAQs.Footnote 3 In the compilation of this corpus only web pages which clearly belong to these genres were manually collected.

The MGC corpus (Vidulin et al. 2007) is the only genre-annotated corpus which allowed multi-labeling, i.e. a page can be categorized into several genre classes. It consists of 1536 web pages classified into twenty genres. Some of these genres are defined on purely functional criteria such as commercial/promotional whereas some are using both form and function (e.g. FAQ). The corpus was collected by targeting web pages in these genres, as well as using random web pages and popular web pages coming from Google Zeitgeist.

I-EN-Sample (Sharoff 2010) consists of 250 web pages randomly selected from the I-EN corpus of web pages representing a snapshot of the English Web texts from 2005  (Sharoff 2006). It was annotated using the Functional Genre Classification (FGC) scheme which consists of seven macro-genres aimed at describing the genre of any text. The genre palette in FGC is based solely on the function or purpose of the document e.g., discussion which includes academic papers, forums, emails or political debates, or instruction which covers FAQs, manuals, tutorials. Therefore, this annotation scheme differs from others by sacrificing depth and specificity of the annotation scheme for coverage.

Table 2 This table summarizes some characteristics of genre-annotated corpora

We are now going to discuss areas of research where we think that the current corpora, regardless of all their diversity, leave open questions and where we can address the corresponding research gap.

Reliability. None of the existing work demonstrates high reliability of their genre annotation via inter-annotator agreement or presents a clear annotation procedure that is then proven to lead to a reliably annotated corpus.

The reasons for this differ. Corpora such as SANTINIS, KI-O4 and Syracuse have been annotated by a single person. As a result, their inter-annotator agreement measures cannot be computed. Given that SANTINIS and KI-04 explicitly searched for prototypical examples of a small set of categories, it is possible that the annotation could be recreated by several annotators but it cannot be assured and there are no publicly available guidelines to test. The MGC, I-EN-Sample and KRYS I corpora have been double-annotated. However, agreement measures were low (\(\alpha =0.56\) for MGC and \(\alpha =0.55\) for I-EN-Sample) as discussed in detail in Sharoff et al. (2010).Footnote 4 Table 3 shows the low percentage agreement for the KRYS I corpus in percentage agreement—chance-corrected agreement tends to be even lower.

Table 3 Human agreement for the KRYS I corpus (Berninger et al. 2008) which has seventy genre classes

The corpus constructed in Egbert et al. (2013, 2015) is annotated via crowd-sourcing. Four annotations were assigned to each web page via the crowd-sourcing website Amazon Mechanical Turk. However, reliability results are not high: on the eight main functional genres, on only 63 % of the web pages at least 3 out of four annotators are in agreement; for the fine-grained genres, on only 43 % of the web pages at least 3 out of 4 annotators are in agreement [see the pilot study in Egbert and Biber (2013)]. In Egbert et al. (2015) chance-corrected agreement is computed at a kappa of 0.47 and 0.40 for coarse- and fine-grained categories respectively, again showing only moderate agreement.

Overall, it is interesting that granularity is an insufficient explanation for low reliability results as in many corpora (coarse-grained categories in Egbert and Biber (2013), Egbert et al. (2015), I-EN Sample, MGC) reliability is low even for a relatively small (\({<}20\)) number of categories.

Corpus design and expert annotation. There are two other issues regarding annotation in current corpora.

Firstly, many of these corpora are designed, i.e. constructed by a focused search for pages that are likely to fit a given category.Footnote 5 This is advantageous for the first test of an annotation scheme as one avoids noisy pages or borderline cases. Learning from prototypical examples can also be good for training automatic genre identification algorithms. However, it is unclear how manual or automatic results transfer to arbitrary web pages. In fact, Sharoff et al. (2010) show that human agreement tends to be even lower for arbitrary web pages than for web pages collected by focused search. A similar point is made by Rehm et al. (2008) who propose a designed corpus as a first step, with a corpus consisting of more randomly selected web pages as a second one. Unfortunately, the authors did not actually follow up with their own web genre corpus following this suggestion. In this paper, we remedy this gap.

Secondly, expert annotation can mislead with regards to the general applicability of the annotation scheme, especially if the same experts conducted annotation and developed the scheme (Riezler 2014). This was the case in SANTINIS, MGC and KI-04, for example. To avoid this problem, we use crowd-sourcing with a larger number of naive annotators that are distinct from scheme developers. In contrast to Egbert et al. (2013, 2015), who also use crowd-sourcing, we do not focus on coverage but on reliability, so that these efforts are complementary. To the best of our knowledge, this is therefore the first crowd-sourcing effort for genre annotation with demonstrably high inter-annotator agreement.

Size. Many existing collections are not large enough to ensure representativeness of genre classes. Table 2 shows the maximum, minimum and median number of web pages per genre category. As can be seen, they often have few annotated web pages per category, especially the KRYS-I and Syracuse corpora, while machine learning algorithms often require a reasonable number of training examples in order to produce satisfactory results. A notable exception is Egbert et al. (2015): although it also contains many genres with few or no examples, 24 of the 56 genres used are represented by over 100 pages.

Format. Another major drawback of some existing corpora is that they have been preserved in non-HTML formats such as PDF or plain text. For instance, each web page in KRYS I corpus is saved in PDF format. As a result, automated tools are needed to convert PDF to plain text or HTML format. However, these tools are error prone: therefore, some information may be lost or wrongly converted. In addition, previous studies in AGI show that HTML tags can improve the accuracy of genre classification (Kanaris and Stamatatos 2009) and should therefore be kept when collecting web genre corpora.

Topic diversity. There are genres which have a natural, strong correlation to certain topics, for example the genre label recipe has a clear connection to the topic label food. These types of correlations between genres and topics are true and explicit connections and will always exist. However, in some existing genre-annotated corpora, there are a number of correlations between genres and topics which are spurious in that they are due to the way the search for genre texts was conducted. For example, a large sample of the frequently asked questions texts in Santinis corpus (Santini 2007) come from web sites about hurricanes. Such spurious correlations can mislead investigations into typical genre properties—(Petrenz and Webber 2011), for example, show that the often best-performing bag-of-words features in AGI perform considerably worse when topic is varied. Therefore, AGI based on these features potentially learns topics rather than genres.

As far as we know, there is no corpus construction approach that explicitly looks into topic diversity of the resulting corpus. We propose a method how to approach this and discuss source and topic diversity explicitly.

4 Aims of this study: creating a reliable genre-annotated corpus via crowd-sourcing

Currently, there is no web genre annotation method established that results in demonstrably high inter-annotator agreement. We try to remedy this gap by building the Leeds Web Genre Corpus (LWGC) which fulfills the following criteria:

  • It is reliably annotated for genre as measured by chance-corrected agreement. Reliability has currently been established for 15 genre classes. We also discuss extensibility of our procedure to other genre classes in Sect. 7.5.

  • It avoids circularity by crowd-sourcing naive annotators that were not involved in annotation scheme development (Riezler 2014).

  • Web pages have been saved in HTML format. Also, the appearance of each web page has been preserved by taking a screen shot of its whole content. The latter can facilitate using visual features as well as textual and HTML features in AGI.

  • It contains a sub-corpus (LWGC-B) that used focused search to create a corpus with a substantial number of web pages for each individual genre category. LWGC-B has been collected from a diverse range of sources in order to avoid creating false correlations between genres and topics. We discuss an approach to measure topic diversity for genre corpora.

  • It also contains a sub-corpus (LWGC-R) that approximates random web page collection to test (1) the transferability of the developed annotation scheme to arbitrary web pages and (2) to explore coverage of the current inventory of genre classes.

5 Genre inventory

The quality of manual annotations depends on the use of precise and consistent guidelines which include category definitions. Therefore, the development of the annotation guidelines must be seen as one of the crucial tasks in annotation projects. Although the main focus of this work is not the development of a comprehensive genre taxonomy, we still need clearly defined categories that our naive annotators have a chance of annotating with little training.

We used several criteria that all our genre classes needed to fulfill.

Form and function. First, we want to use only genre classes and terms that include form constraints in addition to functional constraints. This is in line with the definition we outlined in Sect. 2, and mirrors also (Kessler et al. 1997) who emphasize that a genre should not be so broad that the texts belonging to it do not share any distinguishing properties.

we would probably not use the term genre to describe merely the class of texts that have the objective of persuading someone to do something, since that class (which would include editorials, sermons, prayers, advertisements, and so forth) has no distinguishing formal properties.  (Kessler et al. 1997, p. 33)

Therefore, our genre inventory will automatically exclude the use of, for example, the broad register classes used in Egbert et al. (2013, 2015). We think it is quite possible that some of the broad, functional categories in previous annotation schemes led to low inter-coder agreement—an example are categories such as informative and entertainment in the MGC corpus (Vidulin et al. 2007) or functional genre categories as in I-EN Sharoff (2010) and Egbert and Biber (2013). Defining broad genre categories not only could cause disagreement between annotators, but it could also have a negative impact on automatic genre classification.

Common usage. For naive annotators we want to use genre names which they might have heard before and that are in common use (such as forum) and avoid expert linguistic terminology while remaining specific. This is not just a choice of convenience but also mirrors the fact that genres should be socially recognisable as postulated by the definition we give in Sect. 2.

Text orientation. As another constraint, we were interested in textual genres only and excluded all genres that are mainly visual or include little text (such as link lists, web pages with just a video or a series of pictures etc.).

Variety of different functions. Although our genre names and descriptions will include both form and function, we want to include genres that cover a broad range of functions or what (Biber 1991) calls text type dimensions. Thus, we want genres from the narrative as well as the non-narrative spectrum or from the colloquial/spontaneous vs. edited text spectrum.

Limited set. As this was our first study on genre annotation via naive users, we decided to start with a limited genre palette instead of a complete taxonomy. We therefore made a list of all previously used genre terms that fulfilled the criteria above, mapping equivalent terms as best we could, and chose a subset of 15 genres from a wide spectrum of form and function. We also focused on genres that we hypothesized to be frequent on the Web due to our own informal experience such as blogs, news articles or forums.

In addition, we also tried to narrow our definitions down as much as possible while staying with socially recognizable forms: this led for example, to the inclusion of the genre recipe as distinct from other how-to instructions. We think that this actually allows the definition of other how-tos to be more precise. In Sect. 6.5 we will show that, in accordance with our intuition, the genre recipe is indeed distinct from other instructions with regard to length and type/token distributions.

Final set. Table 4 shows the set of 15 genre labels and their definitions. We are fully aware that our set of criteria could also lead to a set of different genres: however, this set of genres will allow us to test crowd-sourcing for a wide variety of forms and functions and includes many web-typical genres, such as homepages and forums. Other approaches can use their own genre palettes as long as they fulfil the same criteria and have reasonable hope that a similarly designed crowd-sourcing effort will also lead to good annotation for them.

Table 4 Definition of genre labels in the LWGC

Table 5 shows how these 15 selected genre classes correspond to those used in other genre-annotated corpora. However, since different genre-annotated corpora used different genre classes with different levels of granularity, any one-to-one comparison between our genre labels and their genre classes can only be approximate. For example, the genre label journalistic in MGC can include several genres in our corpus such as news, editorial, interviews and reviews. Another example is the periodicals (newspaper, magazine) category from the KRYS I corpus which is very broad and can include many genre classes such as recipe, interview and reviews.

Table 5 This Table illustrates which genre classes in our corpus are also included in existing genre-annotated corpora

The genre inventory in Table 4 applies to both sub-corpora of the LWGC. We explore the coverage of our scheme in Sect. 7.

6 LWGC-B: a web genre corpus designed via focused search

Web corpora are categorized into designed and random corpora according to their collection method (Kilgarriff 2012). The content of a designed corpus is selected based on its design specification, normally following a focused search method. In contrast, the content of a random corpus represents a (more or less faithful) snapshot of the web. HGC (Stubbe and Ringlstetter 2007) and UKWac (Baroni et al. 2009) are examples of designed and random corpora, respectively.

As explained in Sects. 1 and 3, we use a designed corpus as the first step for testing our annotation scheme and crowd-sourcing effort, for two reasons. First, we can provide a corpus with a large number of web pages for each category via this method. While collecting random web pages is fast and cheap, there is no guarantee that it fulfills this criterion. Second, manually collected, prototypical examples provide a good test bed for using naive annotators. If agreement cannot be established on the prototypical pages, it is unlikely to be achieved on random pages. It is also possible that prototypical examples are better for training machine learners. The use of a designed corpus was also suggested by Rehm et al. (2008) as an initial step when building a reference corpus of web genres.

On the flip side, a designed corpus will not give us an accurate representation of the actual genre distribution on the web nor will it tell us the coverage of our annotation scheme. Annotation results on clear and prototypical web pages are also likely to overestimate inter-annotator agreement (Sharoff 2010). We will investigate those issues in Sect. 7 where we collate and annotate a smaller, random corpus, the LWGC-R.

6.1 LWGC-B: corpus compilation

We hand-selected web pages mainly from existing web directories, particularly the Yahoo DirectoryFootnote 6 and Open Directory ProjectFootnote 7 websites. We selected 3964 web pages from a diverse range of sources to avoid creating false correlations between topic and genre labels. We will discuss the source and topic diversity of the corpus further in Sect. 6.6.

In the next phase, we used the KrdWrd tool (Steger and Stemle 2009) to download the web pages in HTML format. However, only saving a web page in HTML format does not guarantee the preservation of its appearance. To achieve this aim, we could, for each web page, save all its graphics and style files, or take a screen shot of its whole content. We chose the second option and used KrdWrd to also preserve each web page as an image.

6.2 LWGC-B: annotation procedure

After collection, the corpus needs to be annotated with the set of chosen genre labels (see Sect. 5), which can be a very time consuming and expensive task. However, in recent years, the advent of crowd-sourcing (e.g. via Amazon Mechanical TurkFootnote 8) has facilitated annotation tasks so that this phase can be done more cheaply and faster than ever before. Amazon Mechanical Turk (MTurk) has been used for a variety of labelling and annotation tasks in Natural Language Processing e.g. word sense disambiguation, word similarity, text alignment, temporal ordering (Snow et al. 2008); machine translation (Callison-Burch 2009) and building a question answering dataset (Kaisser et al. 2008). It has also been used for genre annotation by Egbert et al. (2013, 2015) but without establishing high inter-annotator agreement (see Sect. 3).

In addition to saving expense and time, we can ensure easy re-use of the annotation scheme if even naive annotators with short guidelines achieve high reliability. The fact that the annotators are independent of scheme developers also avoids circularity in annotation (Riezler 2014).

6.2.1 Amazon’s mechanical turk

The Mechanical Turk web site provides a service which enables requesters, such as researchers or companies, to create and publish jobs also known as Human Intelligence Tasks (HITs). These HITs can be carried out by untrained MTurk workers (turkers) all around the world for a small amount of money. The main advantages of Mturk are low cost and speedy task completion as well as its infrastructure, which allows the requesters to develop their HITs using standard HTML and Javascript.

With turkers, quality control is crucial in order to detect poor quality or randomly selected answers. Moreover, Mturk HITs, like any other web-based interface, are vulnerable to automated scripts, also known as bots, which are used by some turkers in order to maximize their income (Mason and Suri 2012). We therefore used two types of qualification criteria in our HIT design, as provided by MTurk.

Firstly, MTurk provides “system qualifications,” which are independent of the specific task created. They include HIT submission rate (the percentage of accepted HITs eventually submitted by the turker), HIT approval rate (ratio of HITs approved by the requester compared to the total number of HITs submitted by the turker), HIT rejection rate (ratio of rejected HITs compared to the total number of HITs submitted by the turker) and location (the worker’s country of residence).

The second type of quality control measures is task-specific. It includes the possibility of a pre-task qualification test designed by the requesters. Up to five qualification criteria can be assigned to a HIT by the requester. Only turkers who pass these qualification measures are permitted to complete the HITs. With regards to after-task quality control, Mturk enables the requesters to download and (automatically or manually) review the submitted works, then reject poor quality data and only pay for the HITs which they approve. In the next section, we describe both the system qualifications and task-specific pre- and after-task quality controls that we use.

6.2.2 HIT design and quality control

This section describes the details of of our HIT design and quality control measures.

HIT design. Turkers were presented with the list of our 15 genre categories together with short guidelines that allowed them to view category definitions (see Table 4). They were also able to view example pages for the categories, if wished for. As our genre inventory is not exhaustive, annotators were also allowed to choose the option other for web pages that do not fit any of the 15 classes. In order to keep the annotation task simple, we decided to choose the single-labeling method, i.e. each web page could only receive a single genre label, despite the fact that there are some web pages that might belong to more than one genre class (Crowston and Kwasnik 2004; Kessler et al. 1997; Santini 2008). Annotators needed to click on a link to open the web page to be annotated—the cached web page would then open in a separate window. Figure 1 shows a screen-shot of the annotation task.

A single HIT includes 10 web pages to be annotated, both as this is more time and cost-effective, and because we are going to use this feature in quality control as described below.

Fig. 1
figure 1

Screen-shot of genre annotation task on Mturk website

Quality control. With regard to system qualifications, we restricted the range of workers who can complete our task. As we were looking for experienced workers, we only allowed workers who had successfully completed at least fifty HITS previously. To ensure diligence, we restricted the task to workers with an approval rate of 95 % or greater.

As a task-specific pre-task qualification test, we let turkers read the definitions and examples of genre classes and then complete a trial HIT of ten genre annotations on pages that we deemed highly prototypical and therefore should be annotated without much scope for error. Only turkers who completed this qualification test with a score of at least 80 % were allowed to take part. This was supposed to weed out bots and random clickers.

For after-task quality control without excessive manual work or introducing substantial expert bias, we used one of the ten web pages to be annotated per HIT as a “trap” question. We selected a set of twenty web pages that the first author of this paper judged as unambiguous and clear examples of one of our predefined genre categories. We used these web pages as trap questions. We performed semi-automated monitoring of the annotations by checking the answers to the trap questions and rejected the workers who did not give the right answers to the trap questions at least 80 % of the time.

Because adding more annotators can help to reduce annotation bias, it is encouraged in human annotation projects to have as many annotators as possible Beigman (Klebanov and Beigman 2009). We chose to have five annotations per web page: Snow et al. (2008) compared the quality of annotation done by experts and Mturk workers and concluded that an average of 4 turkers often provides expert-level label quality.

6.3 Inter-coder agreement measures

In Natural Language Processing and machine learning, a reliably annotated dataset plays a crucial role. The results of research based on unreliable annotation can be considered as untrustworthy, doubtful and even meaningless. In order to measure the reliability of annotation, different annotators judge the same data and the inter-coder agreement is calculated for their judgments. The most commonly used inter-coder agreement measures are: percentage agreement, S (Bennett et al. 1954), Scott’s \(\pi \) (Scott 1955), Cohen’s or Fleiss \(\kappa \) (Cohen 1960; Fleiss 1971) and Krippendorff’s \(\alpha \) (Krippendorff 1970) [see Artstein and Poesio (2008) for a comprehensive survey of inter-coder agreement measures].

Percentage or observed agreement is the simplest measure of agreement among coders. However, this measure does not take into account agreement which is expected to happen by chance. As a result, it can overestimate true agreement. Therefore, other inter-coder agreement measures which correct for chance agreement must be computed. Originally these coefficients (such as Scott’s pi \(\pi \) and Cohen’s kappa \(\kappa \)) were proposed for calculating inter-coder agreement between two annotators. Then Fleiss (1971) proposed a generalization for Scott’s \(\pi \) (called Fleiss’ \(\kappa \) ) and Davies and Fleiss (1982) one for Cohen’s \(\kappa \). Although these two measures often have very similar values, there is one crucial difference between them. For calculating expected agreement for Scott’s \(\pi \) and Fleiss’ \(\kappa \), we only take into account the combined judgments of all coders and not the number of items assigned to each category by each individual coder. In contrast, for calculating expected agreement for Cohen’s \(\kappa \), we take into account the number of times each individual coder assigns an item to a category.

Since in Mturk the annotations have been done by various workers, Cohen’s \(\kappa \) is not applicable as it needs a consistent set of annotators for all items. Therefore, like other annotation studies using crowd-sourcing (Mohammad and Turney 2012; McCreadie et al. 2011; Bentivogli et al. 2011), we calculated Fleiss’s kappa (Fleiss 1971) for the annotation. The next section presents inter-coder agreement results.

6.4 LWGC-B: annotation study results

Overall, 42 turkers participated in annotating the corpus. The annotation task was completed within seven days for a total cost of $820. We paid 40 cents per HIT, therefore 4 cents per page to be annotated (a HIT included 10 pages).

We achieved high reliability with a percentage agreement of 88.2 % and Fleiss’s kappa of 0.874. Based on the interpretation of the inter-coder agreement value by Landis and Koch (1977) (Table 6), this value shows perfect agreement between the annotators.

Table 6 Landis and Koch interpretations (Landis and Koch 1977) of Fleiss’s kappa (Fleiss 1971)

We also computed Fleiss’s kappa for each single category in order to identify the most and the least agreed-on genre classes. To compute single category \(\kappa \) for a target category t, we merge all other categories into one \(non-t\) category and then compute agreement between t and \(non-t\). Table 7 shows the inter-coder agreement for individual genre classes. \(\kappa \) values for the individual categories illustrate substantial agreement among the coders for all categories and, as a result, annotations for all the genre classes are highly reliable. The category recipe was the easiest one for the annotators to identify whereas company/ business home pages caused the most disagreement (this genre category was mostly confused with shop).

Table 7 Inter-coder agreement for individual categories in LWGC-B shows substantial agreement among the coders. Therefore, annotations for all the genre classes are highly reliable

The next phase of building a reliable genre annotated dataset is to convert the annotated corpus into a gold standard. There are a number of different methods to do so (Beigman Klebanov and Beigman 2009). For instance, the annotators can discuss together to reach agreement on the disagreed items (Litman et al. 2006) or if more than two annotators engage in the annotation task, a majority vote approach can be employed (Vieira and Poesio 2000). Also, a domain expert can be used to decide the final label for the disagreed instances (Girju et al. 2006; Snyder and Palmer 2004) or the instances which cause disagreement can be excluded from the dataset (Beigman Klebanov and Beigman 2009).

As we employed Mturk for annotation, reaching agreement through discussion between annotators is not possible. We also decided against expert labelling as we still wanted to keep involvement of the annotation scheme developers to a minimum. As we have five annotations per web page, the majority vote strategy was employed to assign the final label to the disagreed web pages.

There are seven possible types of inter-annotator agreement when there are five annotators.Footnote 9

In order to analyze how often the annotators agreed with each other, we calculated the percentage of each type of inter-annotator agreement (Table 8). For more than 74 % of the web pages all five annotators agreed and for 95 % of the data at least four annotators agreed on a single label, indicating high level of agreement between the coders. Low percentage of the other five types of inter-coder agreement confirms the high value of \(\kappa \) for the annotation task. Disagreements in cases where only three annotators agreed with each other are mainly caused by confusion between news and editorial and between shop and company home page. Since we did not have a majority vote for eight web pages, the final labels for these instances were assigned by the first author of this paper.

Table 8 Distribution of different types of inter-annotator agreement in the LWGC-B

6.5 LWGC-B: corpus statistics

In order to provide further insight into the constructed corpus, we computed some corpus statistics such as number of tokens, number of types and number of sentences (see Table 9). The corpus consists of 3964 web pages, distributed across 15 genres.Footnote 10 Each genre is represented by at least 184 web pages. The distribution is pretty balanced between the genres as we intended for this part of the corpus. It contains more than 7 million words which makes it approximately seven times bigger than the Brown corpus.

Table 9 The corpus statistics for LWGC-B

Table 10 compares genre classes in terms of text statistics. A number of interesting observations can be made from the individual categories’ statistics. First, the length of an average home page is less than for most other genre categories in this corpus. On the other hand, personal blog and interviews contain the longest texts. A closer look at the corpus statistics also reveals that home pages tend to have high type/token ratio compared to other categories. Recipes are substantially shorter than other types of instructions. Based on these observations, it seems that automatic genre classification algorithms could benefit from the discrimination power of these statistics as features.

Table 10 Text statistics for individual categories in the LWGC-B

In order to investigate how up-to-date our corpus is we approximated on which dates the web pages were published or last modified. We used the stanford named entity recognizer (Finkel et al. 2005) to identify all the dates in each page. The most recent date was taken as the publish or last modified date. The results show that about 75 % of the pages were last updated in the years 2010–2012.

6.6 LWGC-B: investigating source and topic diversity

Collecting data for a genre category from topically similar sources was one of the drawbacks of some of the existing genre-annotated corpora mentioned in Sect. 3. In the construction of our corpus, we therefore tried to compile web pages from a diverse range of sources. We calculated source-diversity statistics for each genre in Table 11.

Table 11 Statistics for individual categories in the LWGC-B illustrating source diversity of the corpus

We can see that our focused search avoided collecting too many web pages per site as most genre categories have a median of one web page collected per site. This is positive as it avoids associating genres with specific websites and layouts which are subject to fast change (although, of course, genres also change over time). However, there are still some web sites that might be over-represented such as the maximum of 23 pages from a single shopping web site (which was Amazon).

As even different sources could be on the same topic, we conducted an additional investigation into the topic diversity of our corpus by extracting and comparing keywords of web pages in each genre category. The underlying assumption of this approach is that if web pages in a genre category have topically similar keywords, then that category is not represented by a sufficient variety of topics. We used the log-likelihood statistic (Dunning 1993) to identify words of a web page which have a significantly higher frequency in that page than in the whole corpus. The keyword extraction procedure consists of the following steps (Rayson and Garside 2000):

  1. 1.

    We produced a word frequency list for each web page as well as the whole corpus.

  2. 2.

    For each word in the word frequency list for each web page, we calculate the log-likelihood statistic by constructing the contingency table shown in Table 12, where a and b are the frequency of the word in the web page and the whole corpus respectively; c corresponds to the number of the words in the web page and d is the number of the words in the whole corpus. We can compute the log-likelihood value based on this formula:

    $$\begin{aligned} LL=2\left( \left( a \log \left( \frac{a}{E1}\right) \right) + \left( b \log \left( \frac{b}{E2}\right) \right) \right) \end{aligned}$$
    (1)

    where \(E1=c\frac{(a+b)}{(c+d)}\) and \(E2=d\frac{(a+b)}{(c+d)}\).

  3. 3.

    Then we sort the word frequency list of each web page according to their LL values. The words with the highest LL values are the keywords of the web page as they occur more frequently in the page than in the whole corpus (when normalized for page/corpus size).

Table 12 Contingency table for calculating log-likelihood

We only considered keywords which are significant at the level of \(p < 0.0001\) and also removed some common words such as pronouns and determiners. Next, we needed to generalize from individual web pages to genre classes. To do so, we counted the number of web pages in each genre class that a keyword appears in. Table 13 shows the keywords which appear in the highest number of web pages for each genre category of our corpus. Each number shows the number of documents that the corresponding word has been selected as a keyword for. Although to a certain degree subjective, we indicated potentially spurious “topic invasion” in our corpus with italics in the Table.

A qualitative analysis of the results presented in Table 13 shows very few topic-specific words. As wished for, the majority of the words are genre-specific. For example, frequently asked questions are not distinguished by keywords that indicate FAQs on a specific topic but instead by general question words (such as how or what) and parts of the genre name itself. An exception is the keyword program which might indicate several FAQs on programming languages. Similarly, blogs and forums are not distinguished by specific topics but by, for example, posting dates for blogs, and forum-specific words such as member, join, thread. An exception is the genre category recipe where an unavoidable correlation to the topic food holds. Even there our corpus did not contain only recipes of a specific type, such as mostly vegetable recipes—instead keywords indicate flexible widely used ingredients (with the possible exception of chicken). Some potentially topic-dependent keywords such as cars, autos for editorials are not due to the corpus containing many editorials about cars but because of frequent advertising links in the boiler plate. It is also important to note that some topic-like keywords probably mirror the current distribution of web genres, such as the fact that many personal home pages are of scientists.Footnote 11

In order to compare the topic diversity of our corpus to prior work, we also extracted keywords from comparable genre classes in existing web genre corpora. Table 14 depicts some of the results. Qualitative analysis shows that the faq category in SANTINIS (Santini 2007) is the least topically diverse category. Almost all the web pages in this genre class are about hurricane and tax. Also, Table 14 shows that although keywords from categories such as blog and forum are mainly genre-specific, personal home pages in KI-04 (Meyer zu Eissen and Stein 2004) and SANTINIS seem to have too big a proportion from Artificial Intelligence researchers and mathematicians, respectively (over 10 % each).

Table 13 Keywords from genre categories in LWGC-B
Table 14 Keywords from some of the genre categories of the existing web genre corpora

7 LWGC-R: human annotation study on random web pages

So far we described different phases of constructing a designed web genre corpus. We chose to build a designed corpus as opposed to a random corpus because we wanted to have a balanced collection with a large number of web pages per genre category. The result of human annotation showed high inter-annotator agreement. The questions that we are seeking to answer in this section are twofold: firstly, can we achieve such high inter-annotator agreement on more arbitrary web pages, as well? Secondly, how good is the coverage of our genre inventory when applied to web pages that are not selected by focused search for particular genres?

In order to answer these questions, we repeated the same annotation study on a random corpus that builds on web search results. The following subsections describe the corpus collection, the corpus annotation and the results of the experiment in detail.

7.1 LWGC-R: web page collection

We use random conjunctive queries to a search engine for collecting an approximation of random web pages (see Manning et al. (2008, p398f.) for an in-depth discussion of the difficulties of collecting a random part of the Web). The BootCat toolkit (Baroni and Bernardini 2004) offers an easy way to use such random conjunctive queries via seed keywords.

Two things distinguish this method from a truly random web page collection (which would only be possible if we had access to a snapshot of the whole web). Firstly, if the queries are topic-specific such as Rafael Nadal, tennis, then we naturally will get topic-specific pages back. Therefore, we need to choose very general seeds in our case. We follow (Sharoff 2006) and use a list of the 500 most frequent words extracted from the BNC corpus as seeds. These are mostly function words. BootCat creates a list of n-tuples out of the seed words by randomly combining them. We used 3-tuples in this experiment (e.g., have, we, which). These 3-tuples are used as random conjunctive queries to a search engine. Secondly, as search engines, such as Google, rank and retrieve web pages based not only on keyword occurrence but also on their popularity, we actually do not get a truly random result either but rather a snapshot of popular web pages. In our case, this is not necessarily a disadvantage as being able to label the most used parts of the web is important. However, there is also a genre bias when using the very top-most results which tend to be commercial home pages (Lim et al. 2005). Therefore, we ignored the first 30 URLs retrieved for each query and collected the 20 URLs which were ranked from 31th to 50th positions. Overall, fifty queries were sent to a search engine via BootCat, leading to the collection of 1000 URLs. After the URL collection phase, we downloaded the web pages using the KrdWrd tool (Steger and Stemle 2009).

We call this part of the corpus LWGC-R(andom). It must be noted that, even with our safeguards, the use of a search engine will still bias our corpus towards certain pages, in particular pages indexed by the search engine, more popular documents, as well as longer and recent documents (Manning et al. 2008, p398f.).

7.2 LWGC-R: annotation procedure

We carried out exactly the same annotation study as for LWGC-B, using Amazon Mechanical Turk. Annotators had the option to choose one of our 15 predefined genre categories or the option other for each web page. We set the number of annotations per web page to five. Moreover, the same quality control measures used in the experiment described in Sect. 6.2.2 (e.g. trap question, qualification test and high approval rate) were also adopted in this experiment. The annotation cost 222 Dollars.

7.3 LWGC-R: annotation results

To measure the reliability of the annotation, we calculated the inter-coder agreement measures. For this experiment, the percentage agreement is 78.15 % and \(\kappa \) is 0.712, which shows substantial agreement between the annotators (see Table 6). Therefore we can consider the annotation reliable.

We also calculated \(\kappa \) for individual genre labels (Table 15). The \(\kappa \) value is above 0.6 for all genre labels except story and interview. Quite importantly, the agreement for the category other is high which means that the current genres cannot only be easily delimited from each other (as in LWGC-B) but also from other, arbitrary, web pages. However, the \(\kappa \) value for the two genre classes story and interview is around zero, despite the fact that they have a very high observed or percentage agreement (99.9 and 99.8 %, respectively). A \(\kappa \) of around zero usually indicates very poor agreement. However, this interpretation of the chance-corrected agreement coefficient like \(\pi \) and \(\kappa \) only makes sense if the categories occur reasonably often (Feinstein and Cicchetti 1990). In contrast, the two categories story and interview were hardly ever chosen as can be seen in the fourth column of Table 15 where we indicate the number of times each category was chosen by the annotators. Due to the low number of samples of the two categories in the random corpus, we cannot draw definite conclusions with regard to the reliability of these two categories.

Table 15 Inter-coder agreement for individual categories in LWGC-R shows substantial agreement among the coders

The comparison between the results of the annotation on the designed corpus LWGC-B and the random web pages in the LWGC-R reveals that the \(\kappa \) values on the more randomly selected web pages are lower. This could be due to two reasons: First, it could be because the random dataset is highly skewed. Second, it is harder to obtain a high inter-coder agreement for random web pages as these will include more borderline or even hybrid cases. To provide more insight into this annotation study, we also compute the percentage of each type of inter-annotator agreement in Table 16. For 59.40 % of the web pages in LWGC-R all five annotators agreed and for more than 80 % of the data at least four annotators agreed which indicates high level of agreement between the coders. However, when we compare the two Tables 16 and 8 we see that annotators find it harder to agree on the random web pages. Nevertheless, the result of this study still shows substantial agreement between the annotators and, as a result, it was a successful annotation study.

Table 16 Distribution of different types of inter-annotator agreement in the LWGC-R

We again employed the majority vote strategy to assign the final label to the disagreed web pages in this experiment just as for the designed corpus. As shown in Table 16, there are seven possible types of inter-annotator agreement when there are five annotators. However, there is no majority for the last three types. Therefore, as we did not have a majority vote for 34 web pages, we excluded them from the gold standard corpus.Footnote 12

The distribution of the genre categories in LWGC-R is very skewed (Table 17). While genres such as company home pages and news articles comprise a high percentage of the total number of web pages in LWGC-R, other genre categories such as biography and personal home pages have very few web pages assigned to them. No web page represents the genres story and interview.

Table 17 Genre distribution in the LWGC-R

7.4 LWGC-R: source and topic diversity

As noted in Sect. 3, a corpus used for automatic genre classification must be source and topic diverse. To achieve this for LWGC-B, we collected data from a wide range of sites. In contrast, the LWGC-R corpus was collected randomly, and it is interesting to see how topic and source diverse this corpus is.

We investigate the source diversity of the LWGC-R corpus by calculating the maximum, minimum and median number of websites per genre category (Table 18). The result shows that web page selection via random conjunctive queries as we used for LWGC-R collected data from a diverse range of websites. The maximum number of web pages selected from the same site is very low for all categories with the exception of the category other, where 31 web pages are selected from a single site (Wikipedia). The frequent inclusion of Wikipedia is most likely due to the popularity bias of current search engines.

Table 18 Statistics for individual categories in the LWGC-R illustrate source diversity of the corpus

In order to investigate the topic diversity of the LWGC-R corpus, we employed the technique described in Sect. 6.6 to extract keywords for the genre categories that comprise more than 1.5 % of the LWGC-R corpus. The results are presented in Table 19. Although the majority of the keywords are genre-specific, there are some topic-specific keywords such as “James LeBron” in the news articles. The reason for the presence of such topical keywords could be the recency bias of the collection method via search engines, i.e. collection at a particular point in time does not achieve temporal diversity. In future work, temporal diversity is therefore an additional factor that should be taken into account when collating a random genre corpus, i.e. the corpus collection should be performed at several time points instead of a single time point, at least for genres with a strong temporal connection such as news.

Table 19 Keywords from genre categories in LWGC-R which comprise more than 1.5 % of the corpus

7.5 LWGC-R: extending coverage

Table 17 shows that 45.34 % of pages in LWGC-R did not belong to any of our 15 predefined genre categories, indicating a somewhat more than 50 % coverage for our 15 genres. Researchers in genre classification have come up with long lists of genre classes, e.g., 292 genre labels in the Syracuse corpus (Crowston et al. 2011) or 500 genre labels listed in Dimter (1981). Therefore, the web pages categorized as other in this experiment could belong to any genre class in these taxonomies.

New genre labels. In order to increase the coverage of genre annotation in the LWGC-R corpus, we investigated what genre classes the web pages annotated as other mainly belong to. We observed that the class other consists of a considerable number of Wikipedia web pages and dictionary entries as well as directory web pages containing lists of links.Footnote 13 In addition, we could easily identify two genre categories song lyrics and quotes.

We tried to define these five genre classes as precisely as possible (Table 20). Then, we conducted another annotation experiment on MTurk in order to investigate how reliably humans can identify these additional five genre categories. The annotation procedure was exactly the same as the one described in Sect. 6.2 but was conducted only on the 438 pages in the LWGC-R gold standard previously defined as other.

Table 20 The definition of additional genre classes

Annotation results for new genre labels. For this experiment, the percentage agreement on 438 pages is 79.4 % and \(\kappa \) is 0.650 which indicates substantial agreement between the annotators (see Table 6).Footnote 14 Table 21, which depicts inter-coder agreement for the five individual categories, provides a more detailed picture of how reliable each genre class is.

Table 21 Inter-coder agreement for the additional genre classes in LWGC-R

While \(\kappa \) values for quote, lyric and dictionary are very high, and the value for link lists is substantial, the encyclopedic articles are not easy to identify reliably. Although naturally Wikipedia articles were easily identified as encyclopedic, there remained confusion between the border of an encyclopedic article and other informational descriptions as well as scientific articles. Figure 2 illustrates an example web page that creates such disagreement.

Table 22 shows the number of web pages for each of these five additional genre classes where at least three out of five annotations agreed. Adding these five genre classes to LWGC-R increases the genre coverage in this corpus to 74 %. Therefore, it is possible to extend the genre annotation coverage substantially.

Table 22 Distribution of the additional genre classes in the LWGC-R
Fig. 2
figure 2

An example web page which causes confusion between the classes Encyclopedia-type articles and other. http://epa.gov/climatechange/science/recentslc.html

Overall, the results show that our methodology of annotation can be expanded to more genre categories, although there are some genre classes that might not be suitable for MTurk annotation or need more clarification and refinement in terms of definition. It might also help to offer contrasting genre categories when introducing related genres (such as offering scientific articles as a contrast to encyclopaedic articles).

8 Conclusions and future work

In this paper, we present the first demonstrably reliably annotated web genre corpus. We developed precise and consistent annotation guidelines for well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing. This avoids several problems in prior work such as annotation expense and speed. It also reduces dependency on experts and the resulting uncertainty about transferability of the annotation scheme to groups outside the development group.

Our corpus consists of two sub-corpora, of which one is created via focused search and the other via a more random sample of web pages returned by a search engine. Both are reliably annotated, showing that our annotation scheme is applicable to a wide range of arbitrary web pages as well. Both also are stored without information loss in HTML and visual format. The focused search sub-corpus has a reasonable number of pages for each genre category which is important for training machine learning algorithms. Both corpora are source and topic-diverse, although the random sub-corpus has limited temporal diversity, leading to lack of topic diversity for a single genre (news), which should be addressed in future extensions. We have also shown that our annotation approach can be extended to include further genre categories and therefore extend genre coverage. However, great care needs to be taken to offer very precise category definitions for naive annotators, and each new genre category needs to be checked for reliability.

An important future direction lies in expanding the corpus. Increasing the amount of data can be beneficial for machine learning algorithms (Banko and Brill 2001). Therefore, we should expand the corpus in terms of size which could be done via focused search (as for LWGC-B) or by annotating random web pages (as for LWGC-R). Both of these approaches have advantages and disadvantages. While extending the corpus using random web pages results in an unbalanced corpus, it eliminates expert selection bias by the development group and includes less prototypical examples of genre categories. On the other hand, by employing a focused-search approach, we can create a balanced corpus and overcome the problems that a skewed corpus can create for machine learning algorithms. Therefore, we think extending the corpus should be done by employing both of these approaches. We also think that in addition to source and topic diversity, other variables should also be controlled, such as temporal diversity.

Another way of extending the corpus is to increase the number of genre categories. We show that our original 15 genre categories are sufficient to cover the majority but not the vast majority of web pages and our extended inventory of 20 genre categories covers about three quarters of web pages. As noted in Sect. 5, there is no universally agreed set of genre labels. However, as long as the web users can identify a genre category reliably in an annotation task, it can be added to the corpus. When extending the genre categories, the issues of granularity and a potential hierarchical organisation will need to be investigated.

One other issue of corpus extension is to create a multilingual genre corpus. Currently, we only concentrated on English web pages. It would be interesting to see how genres differ cross-culturally.