Keywords

1 Introduction

Computational argumentation is an interdisciplinary research field that combines natural language processing with other disciplines such as artificial intelligence. A central question in computational argumentation is: What makes an argument good or bad? Depending on the goal of the author of a text, argument quality can involve a variety of dimensions. Evaluating the quality of an argument across these diverse dimensions demands a deep understanding of the topic at hand, often coupled with expertise from the argumentation literature. Hence, manual assessment of argument quality is a challenging and time-consuming process.

A promising technology to streamline argument quality assessment are large language models (LLMs) which have demonstrated impressive capabilities in tasks that require a profound understanding of semantic nuances and discourse structures. LLMs have been effectively employed in tasks such as summarization [32], question answering [16], and relation extraction [31]. Previous research has also investigated the usefulness of LLMs in argument mining tasks such as argument component identification [12], evidence detection [15], and stance classification [4]. Moreover, an emerging trend highlights the adoption of LLMs for data annotation purposes, such as sentiment analysis [8, 24], relevance judgement [11], and harm measurement [18]. To the best of our knowledge, no prior work has investigated the potential of LLMs as annotators of argument quality.

In this paper, we analyze the reliability of LLMs as argument quality annotators by comparing automatic quality judgements with human annotations from both experts and novices.Footnote 1 We compare these quality ratings not only at an aggregate level, but also examine the individual components that make up argument quality. This includes looking at how well models can judge the relevance and coherence of an argument, the sufficiency of its evidential support, and the effectiveness of its rhetorical appeal. Ultimately, our objective is to understand whether LLMs can serve as a practical and reliable tool that supports and enhances human-led effort in argument quality assessment.

Specifically, we ask the following research questions regarding the potential of employing LLMs as argument quality annotators:

  1. RQ1:

    Do LLMs provide more consistent evaluations of argument quality compared to human annotators?

  2. RQ2:

    Do the assessments of argument quality made by LLMs align with those made by either human experts or human novices?

  3. RQ3:

    Can integrating LLM annotations with human annotations significantly improve the resulting agreement in argument quality ratings?

In the following, Sect. 2 reviews the related work, Sect. 3 describes the experimental setup, including the dataset, the annotation procedure, and the employed models, and Sect. 4 presents the results of these experiments.

2 Related Work

We first review existing literature related to the evaluation and annotation of argument quality. Following that, we explore the works that examined the capabilities of large language models (LLMs) as data annotators as well as the degree of alignment between LLMs and human annotators.

2.1 Evaluating Argument Quality

Collecting argument quality annotations is an intricate task that often requires domain-specific knowledge, a number of annotators, and assured consistency in annotator reliability. Numerous works have studied argumentation quality across different domains, employing multiple annotators to classify and evaluate arguments based on various quality criteria. Park and Cardie [22] studied argumentation quality in the domain of web discourse. They employed two annotators to classify 9,000 web-based propositions into four categories based on their level of verifiability. Habernal and Gurevych [13] let five crowd-workers annotate a dataset consisting of 16,000 pairs of arguments with a binary “is more convincing” label, providing explanations for their decisions. Toledo et al. [27] collected a dataset of 14,000 pairs of arguments, each annotated with relative argument quality scores ranging from 0 to 1. They employed between 15 and 17 annotators for each instance to enhance the reliability of the collected annotations.

In the domain of student essays, Persing and Vincent [23] instructed six human annotators to evaluate 1,000 essays based on the strength of argumentation on a scale from 1 to 4. Carlile et al. [3] considered persuasiveness as the most important quality dimension of an argumentative essay. They asked two native English speakers to annotate 102 essays with argument components, argument persuasiveness scores, and further attributes such as specificity, evidence, eloquence, relevance, and strength, that determine the persuasiveness of an argument. Moreover, Marro et al. [19] employed three expert annotators for the annotation of essay components of Stab and Gurevych [25] for three basic argument quality dimensions: cogency, rhetoric, and reasonableness.

Aiming to create a unified understanding of argument quality properties, Wachsmuth et al. [30] proposed a comprehensive taxonomy of 15 argument quality dimensions derived from the argumentation literature. Three expert annotators were employed to annotate 320 arguments [13]. In Sect. 3, we use their quality annotations from 1 (low) to 3 (high) as a reference for our experiments.

Despite the multiple attempts and methodologies to evaluate argument quality, the process remains labor-intensive, time-consuming, and requires a significant degree of expertise. To facilitate the task of argument quality annotation, we propose employing LLMs, as they can potentially provide more reliable and consistent annotations while significantly reducing the required manual effort.

2.2 LLMs as Annotators

Recent work has expanded the role of LLMs from language generation and explored the potential of using LLMs as data annotators. Ding et al. [8] assessed the performance of GPT-3 [2] as a data annotator for sentiment analysis, relation extraction, named entity recognition, and aspect sentiment triplet extraction. They compared the efficiency of BERT [7], trained using data annotated by GPT-3, against BERT trained with human-annotated data. Their findings showed a noticeably similar performance level with substantially reduced annotation costs, promising a potentially cost-effective alternative in using GPT-3 for annotation. A study by Gilardi et al. [11] cross-examined the annotations by ChatGPT [20] and those by crowd-workers against expert annotations across four tasks: content relevance assessment, stance detection, topic detection, and general frame detection. They found that ChatGPT not only outperforms crowd-workers in terms of accuracy, but also shows a high degree of consistency in annotations.

The study by Gao et al. [10] explored automatic human-like evaluations of text summarization using ChatGPT compared to human experts. The model was prompted to evaluate the quality of summaries based on relevance, coherence, fluency, and consistency of the generated summaries. The authors found that ChatGPT’s evaluations were highly correlated with those of human experts.

Zhuo et al. [35] proposed to use LLMs as evaluators of code generation. The authors used the CoNaLa dataset [34] and reported high example-level and corpus-level Kendall-Tau, Pearson, and Spearman correlations with human-rated code usefulness for various programming languages.

In the domain of information retrieval, Faggioli et al. [9] investigated the performance of GPT-3.5 and YouChat for query-passage relevance judgements. Given the high subjectivity of the task, their results showed a reasonable correlation between highly-trained human assessors and fully automated judgements.

Closest to our work is that by Chiang et al. [5], who compared the judgments of GPT-3 on text quality to expert human judgments on a 5-point Likert scale for four quality attributes: grammaticality, cohesiveness, likability, and relevance. Their findings revealed varying degrees of positive correlations between GPT-3 and human judgments, ranging from weak to strong.

When compared to existing research, our work pioneers the study of argument quality annotations generated by LLMs. In order to provide a thorough evaluation, we use an inter-annotator agreement metric to assess the consistency of annotations from these models, human experts and novices. This comparison allows us to understand the alignment between LLMs and human annotators, and to determine the potential of using LLMs as argument quality annotators.

3 Experimental Design

To investigate the reliability of large language models (LLMs) as annotators of argument quality, we conduct an experiment comparing human annotations with ratings generated automatically by LLMs. We treat LLMs as separate annotators and analyze the agreement both within and across groups of humans and models.

Table 1. Descriptions of argument quality dimensions as per Wachsmuth et al. [29].

3.1 Expert Annotation

The goals of argumentation are manifold and include persuading audiences, resolving disputes, achieving agreement, completing inquiries, or recommending actions [26]. Due to the variety of these goals, the dimensions of argument quality are equally diverse. Based on a comprehensive survey of argumentation literature, Wachsmuth et al. [30] proposed a fine-granular taxonomy of argument quality dimensions that differentiates logical, rhetorical, and dialectic aspects. An overview of all quality dimensions is provided in Table 1.

In their work, Wachsmuth et al. [30] employed experts to rate the quality of arguments according to their proposed taxonomy. Three experts were selected out of a pool of seven based on their agreement in a pilot annotation study. These three experts comprised two PhDs and one PhD student (two female, one male) from three different countries. To construct the Dagstuhl-15512-ArgQuality corpus, the selected experts annotated 320 arguments from the UKPConvArgRank dataset [13]. The resulting corpus contains 15 quality dimensions for each argument, each rated on a 3-point Likert scale (low, medium, high) or as not assessable. Each argument in the corpus belongs to one of 16 different topics and takes a stance for or against the topic. The dataset is balanced and contains 10 supporting and 10 attacking arguments per topic. The annotation guidelines, which define all quality dimensions in more detail, are publicly available online.Footnote 2

3.2 Novice Annotation

To provide an additional point of reference for determining the abilities of LLMs as argument quality annotators, we conducted an annotation study involving humans with no prior experience with computational argumentation. We asked undergraduate students to assess the quality of arguments from the Dagstuhl-15512-ArgQuality corpus using the same taxonomy.

The expert annotation guidelines require that annotators have expertise in computational argumentation. To make this task accessible for novices, we paraphrased the annotation guidelines and the definitions of argument quality dimensions to ensure clarity and comprehension. These simplified definitions for each quality dimension can be found in the Appendix. To illustrate, the expert definition of local acceptability of an argument is stated as follows:

Definition 1

(Local Acceptability (Expert)). A premise of an argument should be seen as acceptable if it is worthy of being believed, i.e., if you rationally think it is true or if you see no reason for not believing that it may be true.

The above definition requires an annotator to distinguish between premises and arguments. To ease the understanding and reduce the necessary prior knowledge, we simplify the definition of local acceptability as follows:

Definition 2

(Local Acceptability (Novice)). The reasons are individually believable: they could be true.

We refer to arguments as “reasons” within the simplified guidelines and combine the stance with the issue into a “conclusion”. For example, given the issue “Is TV better than books?” and the stance “No it isn’t”, we state the conclusion as “TV is not better than books”.

Each novice annotator was presented with an argument, a conclusion, and the simplified definitions of the quality dimensions. Identical to the annotation procedure for expert annotations, the annotators were tasked to rate each quality dimension of the argument on a 3-point Likert scale or as not assessable.

In total, we acquired 108 students to annotate the quality of the 320 arguments from the dataset. We assigned 10 arguments to each student to annotate in order to obtain at least three annotations per argument and quality dimension. Since not all students finished their annotations and some students annotated a wrong set of arguments, we obtained a minimum of three annotations per argument and quality dimension only for 248 arguments. We treat the missing annotations of the 72 arguments as non-evaluable. For the 163 arguments for which we collected more than 3 annotations, we select three annotations that maximize the inter-annotator agreement measured by Krippendorff’s \(\alpha \).

Fig. 1.
figure 1

An expert prompt that contains instructions and an example issue, stance, and argument from the Dagstuhl-15512 ArgQuality corpus. This particular prompt example asks the model to rate the clarity of the argument. The reasoning variant of this prompt is colored in gray.

3.3 Models

Due to the complexity of the task, we focus on state-of-the-art LLMs for the automatic evaluation of argumentation quality. Building upon previous research regarding LLMs as annotators (cf. Sect. 2.2), one of the most commonly used models is GPT-3 [2]. Specifically, we use the gpt-3.5-turbo-0613 accessible via OpenAI’s API.Footnote 3 Despite the availability of the newer GPT-4 model [21], we do not employ it in our study due to the significantly higher associated costs.

In addition, we use Google’s recently released PaLM 2 model [1], the successor to the original PaLM model [6]. The authors report comparable results to GPT-4 in semantic reasoning tasks, which makes it interesting for the evaluation of argument quality. For PaLM 2, we use the text-bison@001 version of the model.

Both PaLM 2 and GPT-3 are closed-source language models. We initially intended to incorporate Meta’s Llama 2 model [28] in our experiments, in order to evaluate the performance of open-source LLMs on our task. However, in pilot experiments, Llama 2 with 7 billion parameters did not follow the instructions and therefore did not produce quality scores. Even though the 13 billion parameter version of Llama 2 did generate quality scores, they were seemingly random, with agreement across the multiple runs close to zero. Due to hardware limitations, we did not test the largest Llama 2 model with 70 billion parameters.

PaLM 2 and GPT-3 allow to specify a set of parameters such as temperature to control the diversity of the output, where lowering the temperature reduces the ‘randomness’ of the output. For our experiments, we choose a reasonably low temperature of 0.3. Other parameters that we keep constant across models include \(p=1.0\) of the nucleus sampling [14], most probable tokens \(k=40\), and a maximum of 256 newly generated tokens.

3.4 Prompting

Two different groups of human annotators, the expert annotators of Wachsmuth et al. [30] and the novice annotators recruited for this work, had access to different knowledge sources in their annotation guidelines. To determine whether the impact of this difference is similar between humans and LLMs, we created prompts that reflect the knowledge from the annotation guidelines of experts and novices. We refer to these prompt types as expert and novice prompts.

Besides instructions, an expert prompt consists of an issue, a stance, and an argument from the Dagstuhl-15512-ArgQuality corpus. The expert prompt also contains the name and original definition of the quality dimension from Wachsmuth et al. [30] as well as the annotation scheme (3-point Likert scale or “not assessable”). An example of an expert prompt is shown in Fig. 1.

In contrast to the expert prompt type, novice prompts contain a conclusion (as described in Sect. 3.2) instead of an issue and stance. In the novice prompt, the definition of the quality dimension to be assessed is replaced by the simplified definition. However, the textual argument, which is renamed to “reasons”, and the annotation scheme remain identical to the expert prompt.

Recently, it has been shown that explanation-augmented prompts can elicit reasoning capabilities in LLMs and improve their performance across various tasks [17, 33]. In pilot experiments, we found that GPT-3 produces more consistent annotations if we prompt the model to provide an explanation for the chosen score. We therefore test reasoning prompt variants in which we ask the model to provide an explanation for the generated quality rating.

To take the randomness of the output of LLMs for the same prompt into account, each prompt variant is repeated (at least) three times for each argument and quality dimension. Each prompt repetition is considered as a separate annotator in order to calculate the agreement between the annotations and to draw conclusions about the consistency of the quality annotations of LLMs.

4 Results

To understand the strengths and weaknesses of large language models (LLMs) as argument quality assessors and to answer our research questions, we use the prompting approaches described in Sect. 3.4 to generate LLM annotations for arguments from the Dagstuhl-15512-ArgQuality corpus. The dataset contains 320 statements, 16 of which were originally judged as non-argumentative by expert human annotators and therefore are excluded from the analysis.

First, to identify biases in human and LLM argument quality annotations, we analyze the distribution of assigned labels across all quality dimensions. This distribution is visualized in Fig. 2. Human annotations show an almost balanced distribution between low, medium and high quality ratings. However, it is noteworthy that human novices show a tendency to assign high ratings more frequently than experts. As for models, GPT-3 with expert prompts displays a much more skewed distribution, showing a strong bias towards medium ratings, deviating from the trend observed in human assessors. On the contrary, when GPT-3 is prompted with novice-level guidelines, it tends to assign high-quality ratings more frequently. Notably, annotations generated by PaLM 2 have a similar distribution to that of human annotators which seems promising for the subsequent analysis of agreement with human assessments.

Overall, it can be stated that not only the choice of model, but also the prompt type has a major influence on the generated argument quality ratings. Even slight prompt modifications, such as asking to justify the score, can result in a notable change in the assigned quality scores, which is especially prominent for PaLM 2 with expert prompts in our case. Another interesting observation is that GPT-3 almost always provides a rating for a given dimension: only in 214 out of 21,120 cases (\(\approx \)1%) this model did not generate a score. The instances where PaLM 2 did not assess argument quality sum up to 4,972 (\(\approx \)23%) and mostly stem from content policies, particularly in cases where arguments revolve around graphic topics such as pornography or contain offensive statements.

Fig. 2.
figure 2

Distribution of the assigned quality ratings across all quality dimensions compared between human annotators and LLMs.

Table 2. Inter-annotator agreement per argument quality dimension within each group of human annotators and LLMs, reported as Krippendorf’s \(\alpha \). The dimension with the highest agreement within each group is marked in bold.

4.1 Consistency of Argument Quality Annotations

We address our first research question concerning the consistency of argument quality assessments by comparing the agreement levels within LLM groups with those of human assessors. To quantify the agreement within each group of annotators, we use Krippendorff’s \(\alpha \). To ensure a fair comparison with human annotators, we evaluate the agreement between three LLM annotation runs.

Table 2 shows Krippendorff’s \(\alpha \) for human experts, human novices, and all LLM prompt variants across individual quality dimensions as well as overall agreement. Human annotators exhibit generally low agreement, with a maximum of 0.43 on the local acceptability dimension for novices and 0.45 on reasonableness for experts. This low level of agreement between humans emphasizes the subjectivity and complexity of assessing argument quality in a fine-grained taxonomy. For most of the quality dimensions, novice annotators show slightly higher agreement than those of experts, which could be due to the clearer definitions of the quality dimensions or perhaps due to the optimization of agreement for arguments that received more than three annotations (cf. Sect. 3.2).

In contrast, LLM agreement between annotation repetitions is substantially higher. Interestingly, the PaLM 2 model shows near-perfect agreement for both expert and novice prompts, but shows a notable drop when asked to explain its reasoning. In contrast to PaLM 2, the GPT-3 model exhibits a slight improvement in agreement when asked to provide an explanation. Such disparities might be due to the differences in the underlying architectures and training methodologies of the two models, which require further exploration beyond the work at hand. Overall, both models show a high degree of agreement across different runs, with varying impact of reasoning prompts on the agreement depending on the employed model.

RQ1: Do LLMs provide more consistent evaluations of argument quality compared to human annotators? The observed low agreement among human annotators underscores that evaluating argument quality is indeed a subjective and challenging task. In contrast, the significantly higher agreement among different LLM runs highlights the potential of these models for providing more consistent argument quality annotations.

Table 3. Number of arguments with perfect agreement for each argument dimension within each group of human annotators (expert, novice).
Fig. 3.
figure 3

Inter-annotator agreement (Krippendorff’s \(\alpha \)) between human and LLM annotations for each fine-grained argument quality dimension.

Fig. 4.
figure 4

Inter-annotator agreement (Krippendorff’s \(\alpha \)) between human and LLM annotations for each coarse-grained argument quality dimension.

Fig. 5.
figure 5

Overall inter-annotator agreement (Krippendorff’s \(\alpha \)) between each combination of human expert, novice, and LLM-generated annotations.

4.2 Agreement Between Humans and LLMs

We discovered that LLMs generate annotations more consistently than humans. However, to assert that LLMs can reliably evaluate the quality of arguments, we need to test how the automatic annotations align with the human annotations. Given the low agreement among human annotators, we created subsets of arguments for each quality dimension, where either all expert annotators or all novice annotators unanimously agreed on a score. Table 3 presents the statistics of the resulting subsets with perfect agreement, which we employ for further inter-annotator agreement analysis.

Figure 3 shows the agreement for each quality dimension, as measured by Krippendorff’s \(\alpha \), between human annotations and automatically generated quality ratings by LLMs with different prompts. Overall, we observe moderate agreement across most quality dimensions, with the annotations by PaLM 2 reaching a maximum of 0.71 for appropriateness and global relevance. Regardless of the prompt type, PaLM 2 annotations generally achieve higher agreement with human annotations compared to GPT-3. In the case of local and global sufficiency, there are even systematic disagreements between the GPT-3 assessments and those of human experts. Similarly, disagreement is observed between PaLM 2 annotations and human novices for the global sufficiency dimension.

Overall, there is a large variance in agreement between model and human judgments across different quality dimensions. For example, while the agreement on credibility and appropriateness is in the range of [0.08, 0.62] and [0.06, 0.71] respectively, the agreement on local and global sufficiency fluctuates even more.

In terms of prompt variants, we can see that GPT-3 with expert prompts shows a higher agreement with human expert annotations than with human novice annotations, and a similar trend is observed for GPT-3 with novice prompts and human novices. On the other hand, PaLM 2 with either of the prompt types tends to show higher agreement with human experts. Similar findings can be inferred from the agreement between LLMs and human novices and experts on the coarse-grained quality dimensions that are visualized in Fig. 4.

Table 4. Change in overall Krippendorf’s \(\alpha \) after adding LLM annotations to human expert or novice annotations. Significant changes (\(p < 0.05\)) between the agreement of the original annotations and the modified annotations set are marked with *.

RQ2: Do the assessments of argument quality made by LLMs align with those made by either human experts or human novices? We found that LLMs agree most with human argument quality ratings on fine-grained quality dimensions such as credibility, emotional appeal, appropriateness, and global relevance or on coarse-grained dimensions such as reasonableness and overall quality. Overall, we found varying degrees of agreement between LLMs and human annotators, with PaLM 2 annotations tending to generally align more with those of humans.

4.3 LLMs as Additional Annotators

LLMs can be employed either as independent automatic argument quality raters or as a source of additional annotations to validate a set of human annotations. For the second scenario, we analyze the overall agreement between different combinations of human (expert or novice) and LLM annotators.

Figure 5 illustrates the overall Krippendorff’s \(\alpha \) agreement for each combination of annotator groups. We can see that there is low to medium agreement for each combination of annotators, with the lowest value being 0.25 between human novices and human experts and the highest value being 0.77 between PaLM with novice and expert prompts. Regardless of the prompt type, the agreement between PaLM 2 and GPT-3 is moderate, ranging from 0.38 to 0.67. This suggests the potential efficacy of employing diverse models as supplementary annotators.

We further investigate whether the agreement changes if we incrementally integrate automatically generated annotations into the original set of human annotations. The results reported in Table 4 show that adding PaLM 2 annotations can significantly improve the agreement of human experts as well as human novices. A significant increase is already visible after adding three annotations to the annotations of human experts and four to the annotations of human novices. However, the introduction of GPT-3 annotations leads to a significant decrease in agreement. This can be attributed to the relatively low level of agreement between GPT-3 and human annotators (cf. Fig. 5).

RQ3: Can integrating LLM annotations with human annotations significantly improve the resulting agreement in argument quality ratings? The analysis indicates that the impact on agreement levels when incorporating generated quality assessments with human annotations varies based on the employed LLM. When using a powerful model such as PaLM 2, the agreement of human annotations can be significantly increased by adding three or more generated annotations. These results underscore LLMs as valuable contributors to the annotator ensemble.

5 Conclusion

In this paper, we investigated the effectiveness of LLMs, specifically GPT-3 and PaLM 2, in evaluating argument quality. We utilized four distinct prompt types to solicit quality ratings from these models and compared their assessments with those made by human novices and experts. The results reveal that LLMs exhibit greater consistency in evaluating argument quality compared to both novice and expert human annotators, showcasing their potential reliability. Based on our empirical analysis, we can recommend two modes of application for LLMs as annotators of argument quality: (1) a fully automatic annotation procedure with LLMs as automatic quality raters, for which we found moderately high agreement between PaLM 2 and human expert quality ratings, or (2) a semi-automatic procedure using LLMs as additional quality annotators, resulting in a significant enhancement in agreement when combined with human annotations. In both Modi, LLMs can serve as a valuable tool for streamlining the argument quality annotation process on a large scale.

To further minimize annotation expenses, we intend to expand these experiments to various open-source large language models. In addition to the investigated the zero-shot prompting technique, enhancing agreement with human annotations could involve utilizing few-shot prompting technique or fine-tuning LLMs based on human judgments of argument quality. We see great potential in LLMs as argument quality raters, which, if further optimized to agree more closely with human assessments, can reduce manual effort and expenses, establishing them as valuable tools in argument mining.

6 Limitations

The experiments in this paper are based on the hypothesis that multiple runs of the same model, prompt, and hyperparameters simulate different annotators as a result of nucleus sampling. This hypothesis has not yet been proven, and its validity cannot be inferred from the analysis. The higher inter-model agreement indicates a lower variance in the automatically generated annotations, which might argue against this hypothesis. Therefore, the agreement between model and human annotations has been calculated using examples with perfect agreement only in order to exclude effects of this variance. Further experiments are needed to determine how to replicate the behavior of different annotators. can be found in Appendix Although we deeply investigate LLMs as quality assessors for arguments, the generalizability of our results beyond argumentation is not yet clear. However, due to the complexity and subjectivity of argument quality assessment, as seen from the low human inter-annotator agreement, we argue that this task might be a worst-case scenario for LLMs, and we would expect comparable or even better results in less subjective task domains. However, more experiments are needed to confirm or reject this hypothesis.