Experimental Study Using Annotation Experiments

Grisot, Cristina

doi:10.1007/978-3-319-96752-3_4

Cristina Grisot³

Part of the book series: Yearbook of Corpus Linguistics and Pragmatics ((YCLP))

3602 Accesses

Abstract

Experimental study using annotation experiments, includes four main sections and ends with a summary. First, it discusses several issues linked to using annotation data, such as reliability, validity and the measurement of inter-annotator agreement. Following the proposal made in Grisot (J Pragmat 117:245–263, 2017a), inter-annotator agreement rates, measured with the Қ coefficient, are interpreted as dependent on the degree of accessibility to consciousness and the degree of availability to conscious thought, and, as such, on their conceptual or procedural nature. Second, it advances a series of hypotheses regarding the meaning of Tense, Aktionsart and Aspect, and their predictions with respect to comprehenders’ behaviour when they have to evaluate it consciously in annotation experiments. Third, it describes the annotation experiments and discuss their results. Fourth, in order to assess the role of Tense, Aktionsart and Aspect in predicting the verbal tense used in a target language, the results of a generalized mixed model suited to the data are discussed.

You have full access to this open access chapter, Download chapter PDF

Keywords

4.1 Dealing with Annotation Data: Inter-annotator Agreement and the Қ Coefficient

Inter-annotator agreement is widely used in corpus linguistics, computational linguistics, discourse studies and empirical pragmatics to evaluate agreement between two or more annotators when dealing with various types of linguistic information, ranging from semantic information to syntax, discourse phenomena (discourse relations , discourse connectives), figurative language and pragmatic usages of linguistic expressions, to name but a few. Inter-annotator agreement rates were needed because of scholars’ worries about the subjectivity of the judgments required to create annotated resources, which may further serve as gold-standard data (i.e. trustworthy human-annotated data) for training, testing and evaluating the performance of automatic tools. As such, the main purpose was to assure reliability, defined as the adequate ‘consistency among independent measures intended as interchangeable’ (Moss 1994, 7) and validity, defined as the ‘consonance among multiples lines of evidence supporting the intended interpretation over alternative interpretations’ (Moss 1994, 7).

As I ague in Grisot (2017a), following Krippendorff (1980), reliability has three facets: stability of the process over time; reproducibility of the process under varying circumstances, at different locations and using different annotators; and accuracy, which refers to the degree to which a process conforms to a known standard. Potter and Levine-Donnerstein (1999, 271) point out that, of these three facets, reproducibility is ‘the strongest realistic method by default’ to assess reliability. This is the case because stability is directly dependent on the annotators’ memory, while accuracy is not always achievable because, in some cases, known standards do not exist. Validity, on the other hand, may be established by a two-step process. The first is to develop an annotation scheme which guides the annotators in the analysis of the content submitted to them for judgement. According to Poole and Folger (1981, cited by Potter & Levine-Donnerstein), annotation guidelines are ‘a translation device that allows investigators to place utterances into theoretical categories’ (Poole and Folger 1981, 477). As such, when the annotation guidelines are anchored in a theory, their validity can be assessed against theoretical predictions. The second step for establishing validity is to assess the annotators’ judgement against a known standard. As Potter & Levine-Donnerstein point out, this can be done when such a standard exists. When this is not the case, they suggest that the annotators’ intersubjective judgements (that is, judgements which are subjectively derived but shared among annotators) should be used as a standard (p. 266). For them, inter-subjective judgements have the advantage in that they:

give readers the sense that the patterns in the latent content^{Footnote 1} must be fairly robust and that if the readers themselves were to code the same content, they would make the same judgement.

So, Potter & Levine-Donnerstein point to five key elements which are essential for a reliable and valid study: the annotation guidelines; the theory; the standard, if it exists; the inter-subjectivity of judgments (inter-annotator agreement); and the replicability of the results.

One of the first possible measures for inter-annotator agreement rate is percentage agreement. The percentage agreement is the ratio of observed agreements, either between two judges or in the majority of opinions among several judges. There is, though, a problem with inter-annotator agreement rate when it is measured by percentage agreement. This is agreement due to chance. If we consider the case of two judges, the amount of agreement we would expect to occur by chance (if annotators took a decision without accounting for the annotation guidelines) depends on two conditions:

The number of categories (e.g. a binary distinction, as with mutually exclusive antonyms such as dead/alive, or a distinction with more than two categories, as with other antonyms such as beautiful/very beautiful/ugly/very ugly).
The frequency of the categories. When the categories are equally frequent, the data is normally distributed. When one category is much more frequent that the other(s), the data are not normally distributed, and are thus skewed.

Given two studies investigating the same phenomenon, the one with a smaller number of categories will have higher agreement rates simply by chance. For example, for two equally frequent categories, there is a 50% chance that, when one judge makes a decision, the second judge will make the same decision (a proportion based on the fact that there only two choices; for four categories, there is a 25% chance that the two judges will make the same judgment).

In order to avoid the problem of agreement by chance, inter-annotator agreement can be measured with a series of chance-corrected coefficients , such as such as Cohen’s kappa (Cohen 1960, Carletta 1996) or Aickin’s alpha (1990). The most frequently used is Cohen’s Kappa (Carletta 1996) (henceforth Қ), whose values range from 0 (signalling that there is no other agreement than that expected by chance) and 1 (signalling perfect agreement). In studies with more than two judges, several measures can be used to calculate inter-annotator agreement. One option is measuring agreement separately for each pair of judges, and report the average (Artstein and Poesio 2008). Another option is measuring pairwise agreement instead of percentage agreement. According to Artstein and Poesio (2008, 562), pairwise agreement for a certain item is the proportion of agreeing judgement pairs out of the total number of judgements for that item—in other words, calculating the majority of labels given by the annotators for each item.

In computational and corpus linguistics, the generally accepted threshold for trustworthy data is around 0.6–0.7. However, for pragmatics and discourse studies using this method, Spooren and Degand (2010) argued that Қ values lower than this threshold are frequent. According to them, there are two possible explanations for lower Қ values in linguistic studies. The first is that language is semantically underdetermined, redundant and economical, and so the addressees must interpret it in the context . The second is the potential for coding errors, which can be: (i) errors regarding the initial working hypotheses (the annotation guidelines do not entirely capture the considered phenomenon); and (ii) errors due to individual strategies for each judge.

They suggest three methods of reducing coding errors and increasing the reliability of the data. The first is double coding, which consists of a discussion of disagreements: individual strategies become cooperative strategies, since this strategy requires making explicit the reasoning on which the judgement is based, and convincing the other annotator of the quality of the reasoning (e.g. Sanders and Spooren 2009 used double coding for their analysis of two connectives indicating causality in Dutch). The second method is one-coder-does-all, a method relying on systematic but probably subjective judgments. Spooren and Degand (2010, 254–255) explain lower Қ values with respect to the type of information encoded and its high context-dependence due to the fact that language is underdetermined. Their example is that of discourse relations , which can be marked explicitly or remain implicit. In their words,

A coherence relation like cause-consequence can be marked explicitly (using a connective like because), or it can remain implicit (no connective), in which case the coherence has to be inferred; […] This implies that establishing the coherence relation in a particular instance requires the use of contextual information, which in itself can be interpreted in multiple ways and hence is a source of disagreement.

The third is the use of descriptive statistics, such as observed and specific agreement, and a discussion of the possible reasons for disagreements. These measures should complement the interpretation of the Қ value.

However, when annotation experiments are used to investigate naïve (i.e. untrained) speakers’ intuitive behaviour when it comes to a linguistic or pragmatic phenomenon, the constraints mentioned above regarding annotator bias or methods of improving the value of Қ are no longer relevant. As Spooren and Degand (2010, 254) say of the one-coder-does-all strategy,

Of course the coding will be subject to individual strategies developed by the coder, but these strategies will presumably be systematic and there is no reason to assume that such strategies will be conflated with the phenomenon of interest. […] So if our research question is whether judgements^{Footnote 2} occur more of often with want than with omdat¸ an overcoding of judgments will not impede answer to the research question.

This means that the annotator’s strategy corresponds to his/her way of understanding the phenomenon of interest . In other words, one could expect that measuring inter-annotator agreement rates might be influenced by the type of information dealt with. In particular, based on Wilson & Sperber’s cognitive foundations of the conceptual/procedural distinction (1993/2012) (cf. discussion in Sects. 2.3.1 and 2.3.2), one would expect to find systematically different behaviour among native speakers when they evaluate these two types of encoded information consciously. In other words, conceptual meaning is available to conscious thought. Consequently, judging conceptual information is a rather easy task, resulting in high inter-annotator agreement rates. Procedural meaning is more difficult to evaluate consciously than conceptual information. Consequently, procedural information is harder to judge than conceptual information, and it results in medium inter-annotator agreement rates.

4.2 Annotation Experiments with Tense and Its Description Using Reichenbachian Coordinates

4.2.1 Hypotheses and Predictions

The experiments presented in this chapter have three aims. The first is to assess whether comprehenders are able consciously to identify and categorize the configuration of Reichenbachian coordinates E, R and S and their interpretation at two levels. According to Reichenbach (1947) (cf. discussion in Sect. 1.2.1), the meanings of the target verbal tenses tested in this chapter should be described as in Table 4.1. In other words, the meaning of each verbal tense can be split into the three pairs of coordinates E/R, R/S and E/S. In this research, I make the assumption that the three pairs of coordinates do not act at the same level. The first level is the localization of eventualities E with respect to S. Two options are possible: E < S (i.e. past); and E ≥ S (i.e. non-past). At this level, in English , the Simple Past and the Past Continuous both locate eventualities in the past, and therefore have the description E < S. The Simple Present and Future locate eventualities in the non-past, and therefore have the description E ≥ S. As for French , the Passé Composé , Passé Simple and the Imparfait locate eventualities in the past, and therefore have the description E < S. As with English, the Présent and Future locate eventualities in the non-past, and therefore have the description E ≥ S.

Table 4.1 The meaning of verbal tenses using E, R and S (following Reichenbach 1947)

(449)	On raconte qu’un Anglais vint un jour à Genève avec l’intention de visiter le lac. Il monta dans l’une de ces vieilles voitures où l’on s’asseyait de côté comme dans les omnibus. Il a regardé le lac émerveillé.
	It is said that un Englishman come.3SG.PS one day to Geneva with the intention visiting the lake. He get in.3SG.PS in one of these old cars where you sit.3SG.IMP along the sides as on a bus. He look.3SG.PC at the lake amazed.

(450)	Il y a une heure Max boudait dans son coin, et ça n’est pas près de changer.
	An hour ago Max sulk.3SG.IMP in a corner, and this is not about to change.
	‘For an hour, Max has been sulking in a corner, and this is not about to change.’
(451)	Elle a fini par fuguer à Kaboul, où elle a été recueillie par une femme généreuse. Quelques mois plus tard, elle épousait un jeune cousin de sa bienfaitrice dont elle était tombée amoureuse.
	She finally run.3SG.PC to Kabul, where receive3SG.PC.PSV by a kind woman. A few months later, she marry.3SG.IMP a younger cousin of her benefactor with whom she fall in love.3SG.PQP.
	‘Finally she run to Kabul, where she was taken in by a kind woman. A few months later, she married a younger cousin of her benefactor with whom she had fallen in love.’

(452)	Le jeune soldat mis en cause a agi contre les ordres de ses supérieurs, il (être) aujourd’hui incarcéré et en attente d’être jugé pour meurtre. (Literature register)
	‘The young soldier who was accused behaved against his superior’s orders, he (to be) imprisoned today and waiting to be judged for murder.’
(453)	Marie a pris du poids. Avant de casser sa jambe, Marie (courir) tous les soirs pendant une heure. (Built example, the past condition)
	‘Mary gained weight. Before breaking her leg, Mary (to run) every evening for an hour.’
(454)	Marie s’entraîne pour le marathon. Elle (courir) tous les soirs pendant une heure.
	‘Mary trains for the marathon. She (to run) every evening for an hour.’ (Built example, the non-past condition)

(455)	De son côté, l’Eglise catholique avait organisé, en 1986, la Rencontre nationale ecclésiale cubaine (ENEC), qui - tout en rappelant que Cuba est une nation chrétienne - (prendre acte) de la société cubaine telle qu’elle était et non telle que l’Eglise l’aurait souhaitée. (Journalistic register)
	‘For its part, the Catholic church had organized, in 1986, the Cuban National Ecclesiastic Meeting, which – remember that Cuba is a Christian nation – (take cognizance of) Cuban society as it was and not as the Church would have wished it.’
(456)	Après son accident, Marie était très triste. Elle ne pouvait plus faire ce qui la rendait si heureuse. Marie (jouer) du piano. (Built example)
	‘After her accident, Mary was very sad. She could not do anymore what used to make her so happy. Mary (play) the piano.’

Experimental Study Using Annotation Experiments

Abstract

Keywords

4.1 Dealing with Annotation Data: Inter-annotator Agreement and the Қ Coefficient

4.2 Annotation Experiments with Tense and Its Description Using Reichenbachian Coordinates

4.2.1 Hypotheses and Predictions

4.2.2 French Verbal Tenses and Reichenbachian Coordinates

Participants

Procedure and Material

Results

Discussion

4.2.3 Passé Composé , Passé Simple, Imparfait and the [±Narrativity] Feature

Participants

Procedure and Material

Results

4.2.4 The Imparfait and the [±Narrativity] Feature

Participants

Procedure and Material

Results

4.2.5 Passato Prossimo , Passato Remoto, Imperfetto and the [±Narrativity] Feature

Participants

Procedure and Material

Results

4.2.6 Perfectul Compus , Perfectul Simplu , Imperfectul and the [±Narrativity] Feature

Participants and Material

Procedure

Results

4.2.7 The Simple Past and the [±Narrativity] Feature

Participants

Procedure and Material

Results

4.3 Annotation Experiments with Aspect and Aktionsart

4.3.1 Hypotheses and Predictions

4.3.2 The Simple Past and the [±Boundedness] Feature

Participants

Procedure and Material

Results

4.3.3 The Simple Past and the [±Perfectivity] Feature

Participants

Procedure and Material

Results

Translation and Cross-Linguistic Transfer of Properties

4.4 A Generalized Mixed Model with Tense, Aspect and Aktionsart

4.5 Summary

Notes

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation