Introduction

Extracting meaning from text is a central part of communication research and in the recent years this is increasingly done with the aid of computational methods [1]. This paper introduces a (German-language) dataset for Propositional Claim Detection (PCD), a Natural Language Processing (NLP) task that aims at classifying sentences that can be true or false. Models trained on this dataset can be used for various purposes in communication research and beyond. One natural domain of application is misinformation. PCD is in line with previous approaches to claim detection that aim at the support of fact-checkers by identifying claims that potentially carry misinformation [2].

Sentences that can be true or false are known in philosophy and mathematical logic as sentences with propositional content. As propositional sentences are almost exclusively declarative sentences, PCD mostly leverages grammatical information (e.g., word order, punctuation, or tense) for classification. Grammatical or syntactical information has often been neglected in computational methods for social scientific research [3]. However, it has useful properties that are beneficial to methods like PCD. A core assumption of this paper is that the grammatical information that is leveraged by PCD remains relatively stable across domains and accordingly PCD-models can be used in various contexts and purposes. The guiding questions of this paper are if PCD can reach competitive results to other approaches to claim detection and if these results remain stable across texts of different topical domain, time periods or writing styles. As it will be shown, both questions have a positive answer.

This paper makes multiple contributions: (1) a new task design for the detection of claims is introduced, (2) a dataset for PCD is described and made publicly available, (3) an array of models is tested on the dataset with a special focus on domain adaptation.Footnote 1

Background

Automated fact-checking

Guo et al. [4] provide a survey of different sub-tasks for automated fact-checking and find research on claim detection, retrieving previously checked claims, claim verification, and generating a textual justification for a verdict. Especially claim verification has received much attention (see e.g., [5, 6] for a survey on the related task of stance detection). For this task, a model is supposed to derive an evidence-based verdict (true, false, etc.) for a given claim. However, there has been critique on this task. Glockner et al. [7] argue that many approaches are developed and tested in an unrealistic setup and cannot refute real-world misinformation. Also, Nakov et al. [8] find that human fact-checkers have little trust in a fully-automated pipeline and rather find interest in tools like claim detection that only aim at partial automation.

Models for claim detection are trained to identify claims that potentially carry misinformation. The aim is to decrease the workload of human fact-checkers by providing a pre-selection of relevant claims that can then be verified. A claim is usually understood as one isolated sentence and claim detection is most often designed as a binary sentence classification task. But there are also exceptions. Arslan et al. [2] use a taxonomy with three classes, Konstantinovskiy et al. [9] have seven classes, and Gencheva et al. [10] model it as a rating task. However, often even these approaches are tackled in a binary format, too [11].

The most notable approach to claim detection is the Claimbuster-projectFootnote 2 that contributed a widely used dataset [2] and built models that achieved strong results on it [11]. In their, and most other formulations, claim detection is the task of classifying checkworthy claims (Table 1). Checkworthiness is defined differently by different authors. In case of Arslan et al. [2] checkworthy claims are understood as “factual claims that the general public will be interested in learning about their veracity”. Alam et al. [20] asked the annotators to label a sentence as checkworthy if the they could affirm the following question:” Do you think that a professional fact-checker should verify the claim in the tweet?” A third approach was taken by, for example, Shaar et al. [14], who matched sentences from US-presidential debates with factchecking articles by PolitiFact and labeled them as checkworthy if there was a corresponding article.

Table 1 F: factuality and CW: checkworthiness

This and similar definitions have also attracted criticism. Allein and Moens [21] note that checkworthiness depends on the prior knowledge of a person, which makes the concept inherently subjective. They conclude that checkworthiness is an inapt concept to guide claim detection. One can add that the definition’s reliance on “the general public” is already idealized as the public is often fragmented and different social groups differ in their values or political ideologies. Both aspects are relevant to determine if a claim should be fact-checked.

Arguing that checkworthiness is an editiorial decision that is best left to human fact-checkers, Konstantinowskiy et al. [9] propose an alternative formulation of the claim detection task. They focus on factual claims, understood as sentences about statistics, legal affairs, or causal relationships and opposed to sentences about personal experiences like “I woke up this morning at 7 a.m.”. Risch et al. [18] and Wilms [22] created a German language dataset for the same task. However, in their taxonomy, a positive example is a claim to truth or a sentence that provides external evidence (links, quotes, etc.) rather than internal evidence (personal experience). Gupta et al. [19] define a claim as stating or asserting that something is the case, with or without providing evidence or proof. They use a similar scheme as the previous authors but they also consider personal experiences and humor/sarcasm as positive instances. All three approaches reach strong results (see Table 1).

Computational linguistics

A line of research in computational linguistics that shares many similarities to PCD is Dialogue Act classification (DA). One famous corpus is the Switchboard Dialogue Act corpus (SwDA) that was originally introduced by Jurafsky et al. [23] and provides a fine-grained labeling scheme of 43 classes that involve, for example, opinion-statements, non-opinion-statements, or different types of questions. The task is approached as text classification but also as sequence labeling and there have been attempts [24] that reach scores of 82.9 in \(\hbox {F}_{1}\) (and 91.1 on the MRDA corpus [25]). DA and PCD follow the same theoretical tradition that is based on the works of Austin and Searle.

Argumentation Mining (AM) is closely related to claim detection as it aims at identifying the components of an argument and their relation. Similar to claim detection, extracting claims lies at the heart of AM. However, their definitions are not the same. In AM a claim is understood as a conclusion rather than a premise [e.g., [26]], while for claim detection both can be claims. Moreover, often sentences or textual spans that are understood as claims in AM would rather qualify as opinions from the perspective of claim detection.

The structure of propositional claim detection

Propositional content

PCD is about detecting sentences with propositional content. This term is used in philosophy and mathematical logic and denotes sentences that have a truth value, i.e., that can be true or false [27]. The concept of truth value is different from the concept of truth. Sentences that have a truth value are possibly but not necessarily true. They can also be false. Take the sentences “All cows are ruminants.” and “Are all cows ruminants?” Even without knowing the word “ruminant”, one can acknowledge that the first sentence can be true or false but the second cannot. Accordingly, we do not need to know if a sentence is true, in order to know that it has a truth value. Note that sentences with truth values are a broader class than factual sentences and the two sentence types should not be confused (see “Discussion” section).

Sentences that carry a truth value meet two minimal conditions: (1) the sentence must have a condition of satisfaction and (2) this condition must have a word-to-world direction of fit. For a sentence to have a condition of satisfaction is to have a relation to the world that can be fulfilled.Footnote 3 For example, saying that Japan has a total population of (roughly) 125 million relates to the world and it is satisfied because Japan does have about 125 million citizens (a). Saying that Japan has the dirtiest streets worldwide also relates to the world but it is not satisfied because Japan’s streets are extremely clean (b). And if the Japanese digital minister pledges that the government will offer more digital services, his sentence relates to the world and it is satisfied if the government actually manages to provide more digital services (c). However, other sentence types like apologies or congratulations cannot be satisfied. If the digital minister of Japan apologizes for not providing more digital services, his utterance cannot be satisfied. The purpose of his statement is neither referring to a desired future state of affairs nor is it a statement that can be true or false. It is meant as an apology, which is something different.

The second condition is a word-to-world direction of fit. This term dates back to the works of John Searle and is opposed to the world-to-word direction of fit (see [28, p. 100ff.]). Examples (a) and (b) share a word-to-world direction of fit. This means that sentences (a) and (b) must correctly represent the world in order to be satisfied. Sentence (a) does that while sentence (b) is a misrepresentation of the world. In such cases, we speak of propositional sentences, i.e., of sentences that carry a truth value. Opposed to this are sentences like (c) that have a world-to-word direction of fit. What the digital minister pledges is not intended to correctly represent the world. Instead, the world is supposed to adapt to the words of the Japanese digital minister so that we can say that his promise is fulfilled.

PCD taxonomy

As the purpose of PCD is to detect sentences that carry a truth value, the task is to identify sentences that (1) have a condition of satisfaction and (2) a word-to-world direction of fit. For reasons that will be explained shortly, PCD also adds a third condition: (3) the sentence must be in the present or past tense. Sentences that meet conditions (1)–(3) are called assertions. They are sentences in the past or present tense that carry truth values. Sentences that only meet conditions (1) and (2) but are in the future tense are called predictions. Sentences that meet condition (1) but not (2) are called opinions. Opinions have a world-to-word direction of fit and can be put in any tense. Sentences that neither meet condition (1) nor (2) are called other (Table 2).

Table 2 Taxonomy for PCD

PCD’s structure can be mapped to the (German) grammatical structure in large parts but not entirely. The most obvious correlation is tense as it is a grammatical concept. However, there is more. The German language knows five grammatical sentence types: declarative (statement), interrogative (question), imperative (command), exclamation (affect), and optative (wish) sentences (see, for example, [29]).Footnote 4 Sentences that have a condition of satisfaction are almost exclusively declarative sentences. Declarative sentences are sentences in which the verb comes at second position and the sentence ends with a period sign.

The reduction of propositional sentences to declaratives, however, is an imperfect one. An exception are rhetorical questions, that can carry propositional content but which do not qualify as declarative sentences. For example, if a Fox News moderator rhetorically asks if the 2021 US-election was stolen, it can often be understood as a statement that the election results are not legitimate. This shortcoming is, for the moment, accepted and must be subject to future work.

While the difference between assertions and opinions on the one side and predictions and other on the other side can be reduced entirely to grammatical criteria, namely tense and sentence type, this is not the case for the difference between “assertions” and “opinions”. There is no purely grammatical criterion that exhaustively separates these two classes. In many cases, it is necessary to look at the actual meaning of a given sentence in order to classify it as either one of the two categories. I will briefly outline some heuristics to differentiate between the two classes. More details can be found in the codebook.

  1. 1.

    Within an argument, assertions often take the supporting role, while opinions require support, i.e., assertions are often used as justifications, while opinions need justification.

  2. 2.

    Sentences that express opinions often contain modal verbs like “should” or “must”.

  3. 3.

    Assertions often contain quotes. Note that in this case, we evaluate if something is correctly quoted and not if the content of the quote is correct.

  4. 4.

    Assertions often start with “We see that”, “I know that”, “This proves that”, while opinions often start with “I want that” or “It is our opinion that”.

Finally, one last remark about why tense is included in PCD. Tense only matters for the distinction between predictions and the other classes. The reason why predictions form a category on their own is that PCD is designed to assist fact-checkers. Fact-checkers are usually not interested in predictions as statements about future events can often not be verified.Footnote 5 Statements about the future do not necessarily use the future tense but can also be in the present tense and use temporal indicators like “tomorrow”. However, these indicators are dependent on the time of utterance. Given that PCD is an NLP task, this matters strongly. Often the time of utterance cannot be inferred from the sentence alone but requires contextual information or meta-data. In order to avoid this complication, PCD works with the limited understanding of predictions as declarative sentences in the future tense. Another advantage is that this reduces difference of predictions to other classes to difference in grammar (tense), which is a key motivation for PCD.

Data selection and annotation

The dataset for PCD consists of a diverse set of (German) sources: Newspaper articles (Die Zeit, Tagesschau, Tagesspiegel, taz, Süddeutsche Zeitung), political TV shows (Anne Will and Hart aber Fair), party manifestos,Footnote 6 and tweets.Footnote 7 This broad array of text sources is meant to cover spoken and written text on the one side and formal and casual writing styles on the other. The time period that is covered in the dataset ranges from 1994 to 2022 with a special focus on the years at which national elections took place.

There were 4 coders in total: three student assistants, who were paid according to the collective agreement for student employees, and the author. The training was followed by an evaluation in which all coders had to annotate the same 225 sentences. They achieved a Krippendorff’s alpha of 0.72 and then started the actual annotation process. The entire coding process, including the training, lasted about three months, with weekly workloads of about 400–700 sentences. To increase the output, about one third of the sentences were annotated by only one coder each. In order to speed up the annotation process, there was a rule-based (Regex) pre-annotation for predictions and other. These classes could be filtered with high (but not full) accuracy and were double-checked by a human coder. The remainder of the sentences was coded by at least two coders each. The entire process resulted in a total of 8425 annotated sentences. For more information on the annotation process, see “Appendix A”.

For the optimal use of the limited resources, a pool-based Active Learning (Active Learning, for short) approach was chosen for selecting the data for annotation (Settles, 2010). The core assumption for Active Learning is that during the annotation process, the quantity of possible annotations is limited (due to financial and/or time constraints) but there is (almost) unlimited access to unlabeled data. The unlabeled data is called the pool. One way to choose the data for annotation is to randomly sample from the pool. In Active Learning, samples from the pool are chosen in multiple rounds and according to a “smart” method. There are various methods for doing that and they all have the purpose of choosing samples such that they are more informative for the model and it learns better and faster than if they were sampled randomly. The result is that less data is required for the model to achieve good results. The data was drawn from three pools of different text sources (Table 3). For more information, see “Appendix B”.

Table 3 Data sampling with active learning

Method

Consensus labels

It has been recognized that data quality is essential for machine learning and often fixing data issues improves model performance much more than model-centric approaches like hyperparameter tuning. One area of data-centric AI is label quality: there is research on how to find the best consensus label given multiple conficting annotations and there is research on reducing label noise, i.e., detecting mislabeled data points.

For the present study multiple techniques were evaluated in order to derive the final label for a given sentence: (1) strict majority vote, (2) soft majority vote, (3) Confident Learning [30], (4) CROWDLAB [31], (5) Confident Learning + CROWDLAB.

(1) and (2) are the most basic strategies. For the strict majority vote, only sentences that were seen by 2 or more annotators were included. In case of an agreement \(\le\) 0.5 the sentence was discarded. For the soft majority vote, no sentence was discarded and ties were dissolved hierarchically: assertion > opinion > prediction > other. Strategies (3)–(5) are rather sophisticated and based on more assumptions. They are briefly outlined, however, for a more thorough discussion, see “Appendix C” and “Appendix D” and the original papers.

Confident Learning is a method to detect mislabeled data points. Label errors can come in different shapes, for example, if a data point is assigned to the wrong class or if it belongs to multiple classes but is only assigned to one class. Confident Learning is a method to determine if a mismatch between a model prediction and a gold label is because the model failed or because the gold label is incorrect. A machine learning model or ensemble is leveraged to predict a probability for each label that indicates if the label is correct or not. For the present study, Confident Learning was used to remove all sentences whose gold labels were marked as incorrect.

CROWDLAB is a method to find consensus labels for multiple and possibly conflicting annotations. This method is based on two criteria. First, the annotator quality for each coder is computed. The annotator quality is understood as the level of agreement of one coder with the others. A coder who is often in line with the annotations by the others, receives a high quality score. Second, a machine learning model or ensemble is trained on the data and its predicted probabilities for each sentence are also taken into consideration for deriving the final label. Metaphorically speaking, the model simulates an additional annotator. By computing the annotator quality and simulating an additional annotator, it is possible to derive a label quality score even for sentences that were seen by only one annotator. For the present study, CROWDLAB was used to compute a quality score \(q\in [0,1]\) for each label so that labels with a score below a certain threshold (< 0.7) were removed from the dataset.

Fig. 1
figure 1

Class and document distribution of final dataset

The detailed results for all of the mentioned methods can be found in “Appendix E”. Methods (3)–(5) showed only little improvement compared to the strict majority vote and the soft majority vote performed relatively weak. For the final labels, strict majority vote was chosen, as it is a relatively simple method but performed well. The final dataset consists of 5373 sentences. The class and document distribution is displayed in Fig. 1. Other and assertion occur at a roughly equal amount and so do opinion and prediction. However, the first group makes up about 70% of the dataset, while the second group makes up the remaining 30%. The most frequent document type are protocols, followed by sentences that were drawn from Twitter.

Model selection

Six different embedding methods were used: BoW, Tf-idf, Word2Vec [32], GloVe [33], Fasttext [34], and Sentence Transformer [35]. The first two were calculated using the scikit-learn implementation with the default parameters. Pre-trained embeddings for Word2Vec and GloVe were retrieved from DeepsetFootnote 8 and Fasttext embeddings were calculated using the flair-library [36]. All three were pre-trained on a Wikipedia corpus and their vocabulary is pre-defined. Since all three are word-level embeddings, they needed adjustment to sentence-level. Sentence embeddings were calculated by taking the mean of each word embedding of a sentence. Words that did not occur in the pre-defined vocabulary, were ignored.

Logistic Regression, Support Vector Machine (SVM), and Naive Bayes (Multinomial for BoW and Tf-idf and Gaussian for the other embeddings) were chosen as classical machine learning architectures. Ensemble Methods were included, too: Random Forest, AdaBoost, and XGBoost. All these models were implemented using scikit-learn with the default hyperparameter settings.

A set of transformer architectures were chosen as representatives of Deep Learning: The German base versions of Bert, Electra, RobertaFootnote 9 (called GottBert, see [37]) and DistilBertFootnote 10 as they can be found on Huggingface. It is expected that the transformer architecture, which is a driver for much of the recent progress in NLP, outperforms the classical architectures. However, transformer models require much computational resources and are more difficult to interpret than smaller models like SVM. For this reason, a broad selection of model architectures is evaluated.

The transformer-models were trained for 5 epochs and a batch size of 32 with the default hyperparameters implemented by the transformers-library (learning rate: 5e−5; weight decay: 0.0). Grid search over the number of epochs, batch size, learning rate, and weight decay was performed, however, none of the settings improved over the default settings.

Recall can be considered the most important metric for claim detection as it indicates how many of the relevant claims were actually found. However, high recall should not be achieved at the expense of precision because this means that the system also retrieves a lot of irrelevant claims. In this case, the difference does not matter strongly because for almost all experiments recall and precision are roughly equal. I therefore mostly report \(\hbox {F}_{1}\) as the harmonic mean of precision and recall. All scores were obtained by averaging over the 10-fold stratified Cross-Validation.

Results

(Transformer-) models show strong Performance

Figure 2 displays the scores for the traditional machine learning models and for different embeddings. The scores reach from as low as 0.23-\(-\)0.72. There is no unambiguously best performing model. SVM shows the best performance in combination with GloVe, Fasttext, and transformer embeddings but not with the other embedding techniques. However, it turns out that Naive Bayes and Random Forests perform poorly for each embedding.

Fig. 2
figure 2

Performance of classical machine learning models

The results for the transformer architectures can be found in Table 4. As mentioned before, the performance remains constant across different metrics. The best performing transformer models are Bert and Electra with scores of 0.91 for all available metrics. DistilBert follows closely with a difference of 0.01, while GottBert performed worst of all transformer models. Nevertheless, the scores of the transformers are consequently higher than those of the classical machine learning models.

Table 4 Performance of transformer models

Figure 3 displays the performance of DistilBert, as it performed strongly in the previous tests and requires the least computational workload of all of the transformer models, depending on the individual class labels and document types. The highest class confusion is between assertions and opinions. The performance depending on the text type is balanced. There is no class for which the scores are much lower than for the others. Nevertheless, one can see that the model performance is weaker on sentences from manifestos and Twitter.

Fig. 3
figure 3

Confusion matrix and performance by document for Distilbert

Models adapt well to new domains

As mentioned in the introduction and discussed in the next section, one guiding assumption of this paper is that the reduction of class differences to grammatical differences (word order, punctuation, tense, etc.) leads to a strong generalization across domains. In order to test this hypothesis, several experiments were conducted on DistilBert.

For simulating new domains, the dataset was split according to different criteria (Table 5). To test generalization across different topical domains, all sentences that dealt with the “Turn of Eras”-speech by the German chancellor Olaf Scholz were separated and used as test set, while the remaining sentences were used as training set. The same was done for sentences about the German discussion about a mandatory vaccination during the COVID-19 epidemic and for sentences about the climate summit in Glasgow in 2021 (COP26). In order to test adaptation across text sources, the same procedure was conducted but the splitting criterion was if the sentences were drawn from parliament protocols, tweets, talkshow transcripts, newspaper articles, or party manifestos. It is expected that depending on the source, the texts have a rather monologic/dialogic or formal/informal character. Finally, the dataset was split according to the time period of the sentences. Some were drawn from the 20th century (1994–1998) and others from the 21st century (remainder). Note that for the sentences from the 20th century, data augmentation was performed due to scarcity.Footnote 11

Table 5 Domain adaptation for Distilbert

It can be observed that the performance remains high (\(\hbox {F}_{1}\) \(\ge\) 0.87) and roughly constant across different domains. There are two exceptions. The model did not perform well on sentences from the 21st century. However, this is most likely because data from 1994 to 1998 is scarce and augmentation was only of limited help. The other exception is the relatively weak performance on Twitter data. However, the reduction to an \(\hbox {F}_{1}\)-score of 0.83 is still moderate, given the model did not see any social media data during the training phase.

Classifications are based on the intended criteria

It is argued that the strong domain adaptation is due to PCD’s reduction of class differences to differences in grammar. The grammatical indicators for the different classes can be found in Table 2. For example, assertions often come in the form of quotes and are therefore marked by a colon and quotation marks. Opinions on the other side often contain modal verbs like can, should, or must. In order to find out if these indicators contribute to the classification by a PCD-model, an analysis using Shapley additive explanation (SHAP) values was conducted [38].

SHAP values are a measure of feature importance. SHAP values assign importance to features of individual examples, which leads to local explanations. Since PCD is for sentence classification, SHAP values indicate the importance of individual words or sub-words, depending on the embedding, for the classification of a given sentence into one of the four classes. For global explanations, i.e., explanations about the model’s general behavior, it is possible to apply tendency measures, like the mean, on a sample of multiple examples. For the present study, SHAP values were computed for each sentence using 10-fold cross-validation.

Fig. 4
figure 4

SHAP-values for groups of class-relevant words

Figure 4 shows the most important words for a given class for DistilBert, measured as the mean SHAP value of the given word with respect to each of the four possible labels in the data set. The top 5 words are the most influential words for the given class. For opinions there are many subjective words (sympathy, interesting, hopefully), which aligns with the expectations. However, for the other classes it is difficult to make sense of the results. The next section includes conjugations of “to say”, colons and quotation marks, which indicate quotes and which are strongest for assertions. The following section displays the importance of modal verbs and they are strongest for opinions. The fourth section displays the class-wise mean SHAP-values for “werden” (will) in different conjugations. Since this is the auxiliary verb for the future tense, it is the strongest for predictions. Finally, the class-wise mean SHAP-values for punctuations are displayed. As other consists mainly of questions and exclamations, we should expect question and exclamation marks to be of strong importance to this class. This is the case and, moreover, the period sign is very weak for other and strong for assertions. These results indicate that the model successfully picked up the criteria that were indicative for each class and which were guiding the human annotators. This might not come as a surprise but it shows that the model did not learn spurious correlations.

Discussion

PCD is more neutral and open than its alternatives

Claim detection is the task to identify claims that potentially carry misinformation. The practical goal is to provide a pre-selection of claims to fact-checkers in order to decrease their workload. Most approaches try to detect checkworthy claims, which is a notoriously vague term. In line with Konstantinovskiy et al. [9], PCD drops the notion of checkworthiness and focuses on claims to truth. However, PCD focuses on sentences with propositional content rather than on factual claims.

Propositional content is a broader notion than factuality. For example, a description of a personal experience like “I woke up this morning at 7 a.m.” qualifies as a sentence with propositional content but not as a factual sentence according to the definitions by the other accounts. This is intentional and based on the assumption that factuality depends on the context. It might not be relevant from the perspective of fact-checking or misinformation at what time an average citizen woke up. But it might be relevant if the sentence was uttered by a politician. In this sense, PCD is a radicalization of factual claim detection because it makes even less assumptions about what is relevant for fact-checking and focuses exclusively on the fact that misinforming (as well as informative) sentences must carry truth values.

PCD shares a limitation with factual claim detection: it lacks a criterion for prioritizing one claim over another. Fact-checkers are not interested in simply any claim. Checkworthiness adds a notion of importance to the task and orders the claims according to their relevance. It was argued that checkworthiness is a problematic concept but nevertheless dropping it altogether leaves a gap that must be filled because it limits the applicability of claim detection.

The British factchecking organization Full Fact reports that by applying the models of Konstantinovskiy et al. [9] to parliament debates, Facebook posts, and Tweets, the output is roughly 80.000 claims per day.Footnote 12 Naturally, this is too much for manual analysis. For further filtering, they use heuristics. Certain topics like Sports or Celebrities are dropped and not forwarded to the factcheckers. In other words, focusing exclusively on claims to truth or factual claims is not enough. The models must be enriched by further selection criteria (a possible candidate is discussed in the last section). However, focusing exclusively on the fact that misinforming claims are sentences that can be true or false, PCD is strongly neutral and compatible to additions.

PCD shows strong performance across domains

It was shown that models can reach strong performance on the PCD-dataset. Even though classical machine learning models only reach poor or mediocre scores, transformer models achieve \(\hbox {F}_{1}\)-scores of 0.9 and more. This is also true for lightweight transformer models like DistilBert. In comparison, the strongest performance on the Claimbuster-dataset is 0.91 in \(\hbox {F}_{1}\) [11], 0.76 for factual claim detection by [18] and 0.83 for [9]. The present results are also on the same level with approaches to Dialogue Act classification for which some approaches reach an \(\hbox {F}_{1}\)-score of 0.91 [24]. Accordingly, the question if PCD can reach competitive results can be answered positively.

The second guiding question of this paper was if PCD adapts well to new domains. It was assumed that PCD’s focus on grammatical information, such as tense, word order, or punctuation, enhances generalization across domains. For example, most approaches to checkworthy claim detection use data that is drawn from US-presidential debates. It can be assumed that occurrences of checkworthy claims in this domain differ strongly from occurrences in other domains like COVID-19 or climate change with respect to their meaning and content. In contrast, grammatical cues like tense or word order are not affected by the topic. A declarative sentence is a declarative sentence no matter if the topic is COVID-19 or climate change.

In order to test this assumption, several experiments on domain adaptation were performed. The results showed positive results. This indicates that even though no training-dataset can exhaustively cover all possible domains, it is still likely that models for trained for PCD adapt well to previously unseen domains. This is because PCD focuses on relatively stable features like punctuation or modal verbs, as it was tested using SHAP-values.

Conclusion

This paper introduced Propositional Claim Detection (PCD) and a corresponding dataset. It further presented the results of extensive testing on this dataset. The two major limitations of PCD set the agenda for future research: It is a shortcoming that even though rhetorical questions can carry truth values, they are disregarded by PCD. Future research can build on the taxonomony of PCD and add rhetorical questions to it. Second, it was argued that checkworthiness is a problematic concept but dropping it altogether leaves a gap that must be filled. One possible candidate for it are news values [39]. News values or newsworthiness [40] shares similarities with checkworthiness but it enjoys a stronger theoretical and empirical grounding. Instead of relying on an abstract understanding of “what is interesting to the general public” scholars have found various concrete factors like proximity, timeliness, or conflict that make an event newsworthy. Furthermore, there have been studies that found that certain news values occur significantly stronger in misinformation than in “real news” [41, 42]. This can inform the automated detection of misinformation. Finally, there has been research on automatically detecting various news values that achieves strong results [43,44,45]. Future research should focus on combining PCD and news value detection. The result could be a classifier that aims at checkworthy claims to truth but avoids the aforementioned criticism of checkworthiness.

Despite these limitations, PCD is a solid foundation for claim detection: It is backed by a transparent taxonomy, achieves strong results and adapts well to new domains.