Crowdsourcing for web genre annotation

Asheghi, Noushin Rezapour; Sharoff, Serge; Markert, Katja

doi:10.1007/s10579-015-9331-6

Crowdsourcing for web genre annotation

Original Paper
Open access
Published: 09 January 2016

Volume 50, pages 603–641, (2016)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

Crowdsourcing for web genre annotation

Download PDF

Noushin Rezapour Asheghi¹,
Serge Sharoff² &
Katja Markert^3,4

3563 Accesses
10 Citations
3 Altmetric
1 Mention
Explore all metrics

Abstract

Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are either not tested for inter-annotator reliability or exhibit low inter-coder agreement. Annotation has also mostly been carried out by a small number of experts, leading to concerns with regard to scalability of these annotation efforts and transferability of the schemes to annotators outside these small expert groups. In this paper, we tackle these problems by using crowd-sourcing for genre annotation, leading to the Leeds Web Genre Corpus—the first web corpus which is, demonstrably reliably annotated for genre and which can be easily and cost-effectively expanded using naive annotators. We also show that the corpus is source and topic diverse.

Genre Annotation of Web Corpora: Scheme and Issues

Automatic genre identification: a survey

Article Open access 16 November 2023

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

1 Introduction

In approaching a collection of texts, it is very natural to ask the question: what kinds of texts does it contain? Attempts to categorize texts by their genre go back to Aristotle (Santini et al. 2010). Detecting the genre of a text is beneficial in many areas of Natural Language Processing. For example, in POS tagging and discourse annotation, knowing the genre of a document can help in selecting appropriate language models. Thus, Giesbrecht and Evert (2009) showed the impact of genre on POS tagging performance. Their POS tagger achieves 96.9 % accuracy on newspaper texts whereas it reaches only 85.7 % accuracy on forums. Webber (2009) showed that genres such as letters to the editor vs. newspaper articles differ in the distribution of discourse relations. Genre detection for web texts can also be helpful in information retrieval: Vidulin et al. (2007) make the point that it is difficult for search engine users to find relevant pages that are in the right genres, when starting from standard topical queries.

Realizing this need for genre annotation, even the Brown Corpus, the very first large computerized corpus created in the 1960s, was based on classification of texts into 15 categories, roughly corresponding to genres, such as Press:Reportage, Press:Editorial, Fiction:Adventure, or Fiction:Love and Romance (Kučera and Francis 1967). The British National Corpus (BNC) contains classification of texts according to a range of genre-related parameters, such as the type of publication (e.g., book or newspaper), audience (specialists or lay persons), as well as an explicit genre classification designed by Lee (2001). With the arrival of the web, it became much easier to collect large corpora. The web also resulted in new genres not available before, such as blogs or Internet shopping sites. However, for many genres which feel unique to the web, there are earlier precursors: for example, one could argue that (personal) blogs have similarities to published personal diaries. Section 2 will review more closely the concept of genre and the relations between web and traditional genres.

The interest in the web and its genres (Mehler et al. 2010) resulted in a proliferation of genre-annotated web corpora, each of which was built according to its specific principles, using its own classification scheme and annotation guidelines. Problematically, these corpora are either not tested for annotation reliability as the focus of work was elsewhere or exhibit low inter-annotator agreement. Rehm et al. (2008) already call for a reliably annotated web genre corpus, preferably using a random snapshot of the web, but do not present an actual corpus. This paper takes steps to remedy this research gap. After reviewing prior web genre corpora in Sect. 3, we summarize their shortcomings: these include reliability problems, provision of few pages for many genre classes as well as the occasional lack in source and topic diversity and appropriate storage formats. We suggest that crowd-sourcing is the appropriate method to develop a web genre corpus with high inter-annotator reliability because it allows speedy, accurate and inexpensive genre annotation that detaches the annotation proper from the potential bias of the expert team who developed the guidelines [see also Riezler (2014) for discussing the potential circularity if the same team develops guidelines/terms and annotates].

We then present the Leeds Web Genre Corpus (LWGC) that identifies 15 genre classes reliably via crowd-sourcing. Our genre inventory is detailed in Sect. 5. The LWGC consists of two sub-corpora: The first one (LWGC-B(alanced)) is a designed corpus, where web pages were collected using focused search for specific genres by following links in available web directories before them being submitted to the crowd-sourcing annotation. This method allows us to test our annotation method on a set of web pages with little noise. In addition, it leads to a balanced distribution of genres in the corpus, which is ideal for automatic genre identification via machine learning methods that need sufficient training material for each genre—a property that many existing collections lack. In addition, we collect the corpus from a wide variety of sources, circumventing spurious topic-genre correlations existing in some prior corpora. The LWGC-B(alanced) is described in Sect. 6. Our second corpus (LWGC-R(andom)) then expands our method successfully to a corpus where the pages to be annotated are collected in a more arbitrary way among web pages returned by search engines. The LWGC-R(andom) corpus is described in Sect. 7. This sub-corpus also allowed us to investigate and expand coverage of the underlying genre inventory. However, the emphasis of our paper is not on completeness of the genre inventory but on genre annotation methodology.

Our main contribution is therefore the development of a crowd-sourcing genre annotation method which leads to the first web genre corpus with all of the following properties: demonstrably high inter-annotator agreement, regardless of web page provenance, and achievable by non-expert annotators; a large number of web pages per category; source and topic diversity.

2 The concept of genre

Genre definitions. Many researchers have studied the notion of genre, mostly concentrating on the role that the form and the function of a document play in defining genre. As an example, Campbell and Jamieson (1978) defined genre as:

a group of acts unified by a constellation of forms that recurs in each of its members. These forms, in isolation, appear in other discourses. What is distinctive about the acts in genre is a recurrence of the forms together in constellation. (Campbell and Jamieson 1978, p. 20)

In this definition, the emphasis is on a document’s form. In contrast, Miller (1984, p. 159) argues that the definition of genre must not be limited to the form of the discourse only, but it should also include “the action it is used to accomplish”. In other words, texts in a genre class have the same purpose or function as well as similar patterns of form. Biber (1991) also emphasizes the importance of purpose in recognizing a genre class by stating:

I use the term genre to refer to text categorizations made on the basis of external criteria relating to author/speaker purpose. (Biber 1991, p. 68)

Swales (1990)’s definition of genre is in line with Biber’s as he also recognizes “purpose” as the principle attribute that instances in a genre class share.

We follow Orlikowski and Yates (1994) who use a more comprehensive definition of genre, which combines function and form:

a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form. (Orlikowski and Yates 1994, p. 543)

Orlikowski and Yates’s (1994) definition also adds a new dimension by clearly stressing that genres must be socially recognizable. In other words, genre classes exist only if they are identifiable by people in society (Andersen 2008).

Web genres and traditional genres. Since this paper focuses on genres on the web, it is important to compare web genres with genres in traditional media. The World Wide Web, which was created in 1989, is a communication medium for retrieving and displaying multimedia hypertext documents (Berners-Lee et al. 1994).

Yates et al. (1997) recognized the advent of a new communications medium as one of the reasons for the emergence of variants of existing genres or of new genres. Shepherd and Watters (1998) introduced the notion of cybergenre and proposed a hierarchical taxonomy for classifying the genres on the web compared to traditional genres. According to this classification, cybergenres can be extant (i.e.“ based on existing genres”) or novel (i.e. “not like any existing genre in any other medium”). They give on-line newspapers as an example of extant genres and personal homepages as an example of novel genres.^{Footnote 1} Extant genres are divided into two sub-classes: replicated (i.e.“based on genres existing in other media”) and variant (i.e.“a modification of existing genres”). Novel genres are also separated into two groups: emergent (i.e. “derived but significantly different from existing genres”), and spontaneous ( i.e. “never employed in other media”). They refer to personalized newspapers and frequently asked questions as examples of emergent and spontaneous genres, respectively.

Crowston and Williams (2000) proposed a similar categorization for web genres. They conducted a survey on 1000 random web pages and distinguish four different types of genres: reproduced, adapted and novel genres as well as unclassified web pages (see Table 1). Reproduced genres replicate genres in traditional media to a great extent and were found to be the most frequent type (60.6 %). The second type (adapted genres) evolved from existing genres in the paper world by using the capability of the new medium. For example, a list of items which makes use of the hyper-link capability of the web to link to other pages is creating both a list and an index. As a third type they note novel genres exclusive to the web such as home pages. Although the proportion of novel genres in this study is very low, it is possible that nowadays this group of genres comprises a bigger percentage due to additional frequent genres such as microblogs. Pages remained unclassified due to two main reasons: not knowing the name of the genre and the difficulty of determining the purpose of the web page. Some of these unclassified web pages could be examples of genres still in formation. Therefore, in the process of building a genre-annotated web corpus, we would expect to find some web pages without any genre label.

Table 1 Percentage of types of genres found by Crowston and Williams (2000)

Crowdsourcing for web genre annotation

Abstract

Similar content being viewed by others

Genre Annotation of Web Corpora: Scheme and Issues

Automatic genre identification: a survey

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

1 Introduction

2 The concept of genre

3 Existing genre-annotated web corpora

4 Aims of this study: creating a reliable genre-annotated corpus via crowd-sourcing

5 Genre inventory

6 LWGC-B: a web genre corpus designed via focused search

6.1 LWGC-B: corpus compilation

6.2 LWGC-B: annotation procedure

6.2.1 Amazon’s mechanical turk

6.2.2 HIT design and quality control

6.3 Inter-coder agreement measures

6.4 LWGC-B: annotation study results

6.5 LWGC-B: corpus statistics

6.6 LWGC-B: investigating source and topic diversity

7 LWGC-R: human annotation study on random web pages

7.1 LWGC-R: web page collection

7.2 LWGC-R: annotation procedure

7.3 LWGC-R: annotation results

7.4 LWGC-R: source and topic diversity

7.5 LWGC-R: extending coverage

8 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation